cambridge.university.press.performance.analysis.of.communications.networks.and.systems.mar.2006.ebook...

543

Upload: nitin-gupta

Post on 19-Nov-2015

66 views

Category:

Documents


29 download

DESCRIPTION

Networking Book

TRANSCRIPT

  • http://www.cambridge.org/9780521855150

  • PERFORMANCE ANALYSISOF COMMUNICATIONS

    NETWORKS AND SYSTEMS

    PIET VAN MIEGHEMDelft University of Technology

  • cambridge university pressCambridge, New York, Melbourne, Madrid, Cape Town, Singapore, So Paulo

    Cambridge University PressThe Edinburgh Building, Cambridge cb2 2ru, UK

    First published in print format

    isbn-13 978-0-521-85515-0

    isbn-13 978-0-511-16917-5

    Cambridge University Press 2006

    2006

    Information on this title: www.cambridge.org/9780521855150

    This publication is in copyright. Subject to statutory exception and to the provision ofrelevant collective licensing agreements, no reproduction of any part may take placewithout the written permission of Cambridge University Press.

    isbn-10 0-511-16917-5

    isbn-10 0-521-85515-2

    Cambridge University Press has no responsibility for the persistence or accuracy of urlsfor external or third-party internet websites referred to in this publication, and does notguarantee that any content on such websites is, or will remain, accurate or appropriate.

    Published in the United States of America by Cambridge University Press, New York

    www.cambridge.org

    hardback

    eBook (NetLibrary)

    eBook (NetLibrary)

    hardback

    http://www.cambridge.org/9780521855150http://www.cambridge.org

  • Waar een wil is, is een weg.to my father

    to my wife Saskiaand my sons Vincent, Nathan and Laurens

  • Contents

    Preface xi

    1 Introduction 1

    Part I Probability theory 7

    2 Random variables 9

    2.1 Probability theory and set theory 92.2 Discrete random variables 162.3 Continuous random variables 202.4 The conditional probability 262.5 Several random variables and independence 282.6 Conditional expectation 34

    3 Basic distributions 37

    3.1 Discrete random variables 373.2 Continuous random variables 433.3 Derived distributions 473.4 Functions of random variables 513.5 Examples of other distributions 543.6 Summary tables of probability distributions 583.7 Problems 59

    4 Correlation 61

    4.1 Generation of correlated Gaussian random variables 614.2 Generation of correlated random variables 674.3 The non-linear transformation method 68

    v

  • vi Contents

    4.4 Examples of the non-linear transformation method 744.5 Linear combination of independent auxiliary random

    variables 784.6 Problem 82

    5 Inequalities 83

    5.1 The minimum (maximum) and infimum (supremum) 835.2 Continuous convex functions 845.3 Inequalities deduced from the Mean Value Theorem 865.4 The Markov and Chebyshev inequalities 875.5 The Hlder, Minkowski and Young inequalities 905.6 The Gauss inequality 925.7 The dominant pole approximation and large deviations 94

    6 Limit laws 97

    6.1 General theorems from analysis 976.2 Law of Large Numbers 1016.3 Central Limit Theorem 1036.4 Extremal distributions 104

    Part II Stochastic processes 113

    7 The Poisson process 115

    7.1 A stochastic process 1157.2 The Poisson process 1207.3 Properties of the Poisson process 1227.4 The nonhomogeneous Poisson process 1297.5 The failure rate function 1307.6 Problems 132

    8 Renewal theory 137

    8.1 Basic notions 1388.2 Limit theorems 1448.3 The residual waiting time 1498.4 The renewal reward process 1538.5 Problems 155

    9 Discrete-time Markov chains 157

    9.1 Definition 157

  • Contents vii

    9.2 Discrete-time Markov chain 1589.3 The steady-state of a Markov chain 1689.4 Problems 177

    10 Continuous-time Markov chains 179

    10.1 Definition 17910.2 Properties of continuous-time Markov processes 18010.3 Steady-state 18710.4 The embedded Markov chain 18810.5 The transitions in a continuous-time Markov chain 19310.6 Example: the two-state Markov chain in continuous-time 19510.7 Time reversibility 19610.8 Problems 199

    11 Applications of Markov chains 201

    11.1 Discrete Markov chains and independent random vari-ables 201

    11.2 The general random walk 20211.3 Birth and death process 20811.4 A random walk on a graph 21811.5 Slotted Aloha 21911.6 Ranking of webpages 22411.7 Problems 228

    12 Branching processes 229

    12.1 The probability generating function 23112.2 The limit of the scaled random variables 23312.3 The Probability of Extinction of a Branching Process 23712.4 Asymptotic behavior of 24012.5 A geometric branching processes 243

    13 General queueing theory 247

    13.1 A queueing system 24713.2 The waiting process: Lindleys approach 25213.3 The Benes approach to the unfinished work 25613.4 The counting process 26313.5 PASTA 26613.6 Littles Law 267

    14 Queueing models 271

  • viii Contents

    14.1 The M/M/1 queue 27114.2 Variants of the M/M/1 queue 27614.3 The M/G/1 queue 28314.4 The GI/D/m queue 28914.5 The M/D/1/K queue 29614.6 The N*D/D/1 queue 30014.7 The AMS queue 30414.8 The cell loss ratio 30914.9 Problems 312

    Part III Physics of networks 317

    15 General characteristics of graphs 319

    15.1 Introduction 31915.2 The number of paths with hops 32115.3 The degree of a node in a graph 32215.4 Connectivity and robustness 32515.5 Graph metrics 32815.6 Random graphs 32915.7 The hopcount in a large, sparse graph with unit link

    weights 34015.8 Problems 346

    16 The Shortest Path Problem 347

    16.1 The shortest path and the link weight structure 34816.2 The shortest path tree in with exponential link

    weights 34916.3 The hopcount in the URT 35416.4 The weight of the shortest path 35916.5 The flooding time 36116.6 The degree of a node in the URT 36616.7 The minimum spanning tree 37316.8 The proof of the degree Theorem 16.6.1 of the URT 38016.9 Problems 385

    17 The e ciency of multicast 387

    17.1 General results for ( ) 38817.2 The random graph ( ) 39217.3 The -ary tree 401

  • Contents ix

    17.4 The ChuangSirbu law 40417.5 Stability of a multicast shortest path tree 40717.6 Proof of (17.16): ( ) for random graphs 41017.7 Proof of Theorem 17.3.1: ( ) for -ary trees 41417.8 Problem 416

    18 The hopcount to an anycast group 417

    18.1 Introduction 41718.2 General analysis 41918.3 The -ary tree 42318.4 The uniform recursive tree (URT) 42418.5 Approximate analysis 43118.6 The performance measure in exponentially growing

    trees 432

    Appendix A Stochastic matrices 435

    Appendix B Algebraic graph theory 471

    Appendix C Solutions of problems 493

    Bibliography 523

    Index 529

  • PrefacePerformance analysis belongs to the domain of applied mathematics. Themajor domain of application in this book concerns telecommunications sys-tems and networks. We will mainly use stochastic analysis and probabilitytheory to address problems in the performance evaluation of telecommuni-cations systems and networks. The first chapter will provide a motivationand a statement of several problems.This book aims to present methods rigorously, hence mathematically, with

    minimal resorting to intuition. It is my belief that intuition is often gainedafter the result is known and rarely before the problem is solved, unless theproblem is simple. Techniques and terminologies of axiomatic probability(such as definitions of probability spaces, filtration, measures, etc.) havebeen omitted and a more direct, less abstract approach has been adopted.In addition, most of the important formulas are interpreted in the sense ofWhat does this mathematical expression teach me? This last step justifiesthe word applied, since most mathematical treatises do not interpret asit contains the risk to be imprecise and incomplete.The field of stochastic processes is much too large to be covered in a single

    book and only a selected number of topics has been chosen. Most of the top-ics are considered as classical. Perhaps the largest omission is a treatmentof Brownian processes and the many related applications. A weak excusefor this omission (besides the considerable mathematical complexity) is thatBrownian theory applies more to physics (analogue fields) than to systemtheory (discrete components). The list of omissions is rather long and onlythe most noteworthy are summarized: recent concepts such as martingalesand the coupling theory of stochastic variables, queueing networks, schedul-ing rules, and the theory of long-range dependent random variables that cur-rently governs in the Internet. The confinement to stochastic analysis alsoexcludes the recent new framework, called Network Calculus by Le Boudecand Thiran (2001). Network calculus is based on min-plus algebra and hasbeen applied to (Inter)network problems in a deterministic setting.As prerequisites, familiarity with elementary probability and the knowl-

    edge of the theory of functions of a complex variable are assumed. Parts inthe text in small font refer to more advanced topics or to computations thatcan be skipped at first reading. Part I (Chapters 26) reviews probabilitytheory and it is included to make the remainder self-contained. The bookessentially starts with Chapter 7 (Part II) on Poisson processes. The Pois-

    xi

  • xii Preface

    son process (independent increments and discontinuous sample paths) andBrownian motion (independent increments but continuous sample paths)are considered to be the most important basic stochastic processes. Webriefly touch upon renewal theory to move to Markov processes. The theoryof Markov processes is regarded as a fundament for many applications intelecommunications systems, in particular queueing theory. A large partof the book is consumed by Markov processes and its applications. Thelast chapters of Part II dive into queueing theory. Inspired by intriguingproblems in telephony at the beginning of the twentieth century, Erlanghas pushed queueing theory to the scene of sciences. Since his investiga-tions, queueing theory has grown considerably. Especially during the lastdecade with the advent of the Asynchronous Transfer Mode (ATM) and theworldwide Internet, many early ideas have been refined (e.g. discrete-timequeueing theory, large deviation theory, scheduling control of prioritizedflows of packets) and new concepts (self-similar or fractal processes) havebeen proposed. Part III covers current research on the physics of networks.This Part III is undoubtedly the least mature and complete. In contrast tomost books, I have chosen to include the solutions to the problems in anAppendix to support self-study.I am grateful to colleagues and students whose input has greatly improved

    this text. Fernando Kuipers and Stijn van Langen have corrected a largenumber of misprints. Together with Fernando, Milena Janic and Almer-ima Jamakovic have supplied me with exercises. Gerard Hooghiemstra hasmade valuable comments and was always available for discussions aboutmy viewpoints. Bart Steyaert eagerly gave the finer details of the generat-ing function approach to the GI/D/m queue. Jan Van Mieghem has givenoverall comments and suggestions beside his input with the computation ofcorrelations. Finally, I thank David Hemsley for his scrupulous correctionsin the original manuscript.Although this book is intended to be of practical use, in the course of

    writing it, I became more and more persuaded that mathematical rigor hasample virtues of its own.

    Per aspera ad astra

    January 2006 Piet Van Mieghem

  • 1

    Introduction

    The aim of this first chapter is to motivate why stochastic processes andprobability theory are useful to solve problems in the domain of telecommu-nications systems and networks.In any system, or for any transmission of information, there is always a

    non-zero probability of failure or of error penetration. A lot of problems inquantifying the failure rate, bit error rate or the computation of redundancyto recover from hazards are successfully treated by probability theory. Oftenwe deal in communications with a large variety of signals, calls, source-destination pairs, messages, the number of customers per region, and so on.And, most often, precise information at any time is not available or, if itis available, deterministic studies or simulations are simply not feasible dueto the large number of di erent parameters involved. For such problems, astochastic approach is often a powerful vehicle, as has been demonstratedin the field of physics.Perhaps the first impressing result of a stochastic approach was Boltz-

    manns and Maxwells statistical theory. They studied the behavior of parti-cles in an ideal gas and described how macroscopic quantities as pressure andtemperature can be related to the microscopic motion of the huge amountof individual particles. Boltzmann also introduced the stochastic notion ofthe thermodynamic concept of entropy ,

    = log

    where denotes the total number of ways in which the ensembles of parti-cles can be distributed in thermal equilibrium and where is a proportion-ality factor, afterwards attributed to Boltzmann as the Boltzmann constant.The pioneering work of these early physicists such as Boltzmann, Maxwelland others was the germ of a large number of breakthroughs in science.Shortly after their introduction of stochastic theory in classical physics, the

    1

  • 2 Introduction

    theory of quantum mechanics (see e.g. Cohen-Tannoudji et al., 1977) wasestablished. This theory proposes that the elementary building blocks ofnature, the atom and electrons, can only be described in a probabilisticsense. The conceptually di cult notion of a wave function whose squaredmodulus expresses the probability that a set of particles is in a certain stateand the Heisenbergs uncertainty relation exclude in a dramatic way ourdeterministic, macroscopic view on nature at the fine atomic scale.At about the same time as the theory of quantum mechanics was being

    created, Erlang applied probability theory to the field of telecommunica-tions. Erlang succeeded to determine the number of telephone input lines

    of a switch in order to serve customers with a certain probability .Perhaps his most used formula is the Erlang formula (14.17), derived inSection 14.2.2,

    Pr [ = ] = !P=0 !

    where the load or tra c intensity is the ratio of the arrival rate of calls tothe telephone local exchange or switch over the processing rate of the switchper line. By equating the desired blocking probability = Pr [ = ], say= 10 4, the number of input lines can be computed for each load .

    Due to its importance, books with tables relating , and were published.Another pioneer in the field of communications that deserves to be men-

    tioned is Shannon. Shannon explored the concept of entropy . He in-troduced (see e.g. Walrand, 1998) the notion of the Shannon capacity of achannel, the maximum rate at which bits can be transmitted with arbitrarysmall (but non zero) probability of errors, and the concept of the entropyrate of a source which is the minimum average number of bits per sym-bol required to encode the output of a source. Many others have extendedhis basic ideas and so it is fair to say that Shannon founded the field ofinformation theory.A recent important driver in telecommunication is the concept of qual-

    ity of service (QoS). Customers can use the network to transmit di erenttypes of information such as pictures, files, voice, etc. by requiring a spe-cific level of service depending on the type of transmitted information. Forexample, a telephone conversation requires that the voice packets arrive atthe receiver ms later, while a file transfer is mostly not time critical butrequires an extremely low information loss probability. The value of themouth-to-ear delay is clearly related to the perceived quality of the voiceconversation. As long as 150 ms, the voice conversation has toll qual-ity, which is roughly speaking, the quality that we are used to in classical

  • Introduction 3

    telephony. When exceeds 150 ms, rapid degradation is experienced andwhen 300 ms, most of the test persons have great di culty in un-derstanding the conversation. However, perceived quality may change fromperson to person and is di cult to determine, even for telephony. For ex-ample, if the test person knows a priori that the conversation is transmittedover a mobile or wireless channel as in GSM, he or she is willing to toleratea lower quality. Therefore, quality of service is both related to the natureof the information and to the individual desire and perception. In futureInternetworking, it is believed that customers may request a certain QoSfor each type of information. Depending on the level of stringency, the net-work may either allow or refuse the customer. Since customers will also payan amount related to this QoS stringency, the network function that deter-mines to either accept or refuse a call for service will be of crucial interestto any network operator. Let us now state the connection admission control(CAC) problem for a voice conversation to illustrate the relation to stochas-tic analysis: How many customers are allowed in order to guarantee thatthe ensemble of all voice packets reaches the destination within ms withprobability ?This problem is exceptionally di cult because it depends onthe voice codecs used, the specifics of the network topology, the capacity ofthe individual network elements, the arrival process of calls from the cus-tomers, the duration of the conversation and other details. Therefore, wewill simplify the question. Let us first assume that the delay is only causedby the waiting time of a voice packet in the queue of a router (or switch).As we will see in Chapter 13, this waiting time of voice packets in a singlequeueing system depends on (a) the arrival process: the way voice packetsarrive, and (b) the service process: how they are processed. Let us assumethat the arrival process specified by the average arrival rate and the ser-vice process specified by the average service rate are known. Clearly, thearrival rate is connected to the number of customers . A simplifiedstatement of the CAC problem is, What is the maximum allowed suchthat Pr [ ] ? In essence, the CAC problem consists in computingthe tail probability of a quantity that depends on parameters of interest. Wehave elaborated on the CAC problem because it is a basic design problemthat appears under several disguises. A related dimensioning problem is thedetermination of the bu er size in a router in order not to lose more than acertain number of packets with probability , given the arrival and serviceprocess. The above mentioned problem of Erlang is a third example. An-other example treated in Chapter 18 is the server placement problem: Howmany replicated servers are needed to guarantee that any user can accessthe information within hops with probability Pr [ ( ) ] , where

  • 4 Introduction

    is certain level of stringency and ( ) is the number of hops towards themost nearby of the servers in a network with routers.The popularity of the Internet results in a number of new challenges. The

    traditional mathematical models as the Erlang B formula assume smoothtra c flows (small correlation and Markovian in nature). However, TCP/IPtra c has been shown to be bursty (long-range dependent, self-similar andeven chaotic, non-Markovian (Veres and Boda, 2000)). As a consequence,many traditional dimensioning and control problems ask for a new solu-tion. The self-similar and long range dependent TCP/IP tra c is mainlycaused by new complex interactions between protocols and technologies (e.g.TCP/IP/ATM/SDH) and by other information transported than voice. Itis observed that the content size of information in the Internet varies con-siderably in size causing the Noah e ect: although immense floods areextremely rare, their occurrence impacts significantly Internet behavior ona global scale. Unfortunately, the mathematics to cope with the self-similarand long range dependent processes turns out to be fairly complex and be-yond the scope of this book.Finally, we mention the current interest in understanding and modeling

    complex networks such as the Internet, biological networks, social networksand utility infrastructures for water, gas, electricity and transport (cars,goods, trains). Since these networks consists of a huge number of nodesand links , classical and algebraic graph theory is often not suited to pro-duce even approximate results. The beginning of probabilistic graph theoryis commonly attributed to the appearance of papers by Erds and Rnyi inthe late 1940s. They investigated a particularly simple growing model for agraph: start from nodes and connect in each step an arbitrary random,not yet connected pair of nodes until all links are used. After about 2steps, as shown in Section 16.7.1, they observed the birth of a giant com-ponent that, in subsequent steps, swallows the smaller ones at a high rate.This phenomenon is called a phase transition and often occurs in nature.In physics it is studied in, for example, percolation theory. To some extent,the Internets graph bears some resemblance to the Erds-Rnyi randomgraph. The Internet is best regarded as a dynamic and growing network,whose graph is continuously changing. Yet, in order to deploy services overthe Internet, an accurate graph model that captures the relevant structuralproperties is desirable. As shown in Part III, a probabilistic approach basedon random graphs seems an e cient way to learn about the Internets in-triguing behavior. Although the Internets topology is not a simple Erds-Rnyi random graph, results such as the hopcount of the shortest path andthe size of a multicast tree deduced from the simple random graphs provide

  • Introduction 5

    a first order estimate for the Internet. Moreover, analytic formulas basedon other classes of graphs than the simple random graph prove di cult toobtain. This observation is similar to queueing theory, where, beside theM/G/x class of queues, hardly closed expressions exist.We hope that this brief overview motivates su ciently to surmount the

    mathematical barriers. Skill with probability theory is deemed necessaryto understand complex phenomena in telecommunications. Once mastered,the power and beauty of mathematics will be appreciated.

  • Part I

    Probability theory

  • 2

    Random variables

    This chapter reviews basic concepts from probability theory. A random vari-able (rv) is a variable that takes certain values by chance. Throughout thisbook, this imprecise and intuitive definition su ces. The precise definitioninvolves axiomatic probability theory (Billingsley, 1995).Here, a distinction between discrete and continuous random variables is

    made, although a unified approach including also mixed cases via the Stielt-jes integral (Hardy et al., 1999, pp. 152157),

    R( ) ( ), is possible. In

    general, the distribution ( ) = Pr [ ] holds in both cases, andZ( ) ( ) =

    X( ) Pr[ = ] where is a discrete rv

    =

    Z( )

    ( )where is a continuous rv

    In most practical situations, the Stieltjes integral reduces to the Riemannintegral, else, Lesbesgues theory of integration and measure theory (Royden,1988) is required.

    2.1 Probability theory and set theory

    Pascal (16231662) is commonly regarded as one of the founders of proba-bility theory. In his days, there was much interest in games of chance1 andthe likelihood of winning a game. In most of these games, there was a finitenumber of possible outcomes and each of them was equally likely. The

    1 La rgle des partis, a chapter in Pascals mathematical work (Pascal, 1954), consists of aseries of letters to Fermat that discuss the following problem (together with a more complexquestion that is essentially a variant of the probability of gamblers ruin treated in Section11.2.1): Consider the game in which 2 dice are thrown times. How many times do we haveto throw the 2 dice to throw double six with probability = 1

    2?

    9

  • 10 Random variables

    probability of the event of interest was defined as

    Pr [ ] =

    where is the number of favorable outcomes (samples points of ). If thenumber of outcomes of an experiment is not finite, this classical definitionof probability does not su ce anymore. In order to establish a coherent andprecise theory, probability theory employs concepts of group or set theory.The set of all possible outcomes of an experiment is called the sample

    space . A possible outcome of an experiment is called a sample pointthat is an element of the sample space . An event consists of a set ofsample points. An event is thus a subset of the sample space . Thecomplement of an event consists of all sample points of the samplespace that are not in (the set) , thus = \ . Clearly, ( ) =and the complement of the sample space is the empty set, = or, vice aversa, = . A family F of events is a set of events and thus a subset of thesample space that possesses particular events as elements. More precisely,a family F of events satisfies the three conditions that define a -field2: (a)

    F , (b) if 1 2 F , then =1 F and (c) if F , thenF . These conditions guarantee that F is closed under countable unions andintersections of events.Events and the probability of these events are connected by a probability

    measure Pr [ ] that assigns to each event of the family F of events of a samplespace a real number in the interval [0 1]. As Axiom 1, we require thatPr [ ] = 1. If Pr [ ] = 0, the occurrence of the event is not possible, whilePr [ ] = 1 means that the event is certain to occur. If Pr [ ] = with0 1, the event has probability to occur.If the events and have no sample points in common, = ,

    the events and are called mutually exclusive events. As an example,the event and its complement are mutually exclusive because = .Axiom 2 of a probability measure is that for mutually exclusive eventsand holds that Pr [ ] = Pr [ ]+Pr [ ]. The definition of a probabilitymeasure and the two axioms are su cient to build a consistent frameworkon which probability theory is founded. Since Pr [ ] = 0 (which follows from2 A field F posseses the properties:

    (i) F ;(ii) if F , then F and F ;(iii) if F , then FThis definition is redundant. For, we have by (ii) and (iii) that ( ) F . Further, by DeMorgans law ( ) = , which can be deduced from Figure 2.1 and again by (iii),the argument shows that the reduced statement (ii), if F , then F , is su cientto also imply that F .

  • 2.1 Probability theory and set theory 11

    Axiom 2 because = and = ), for mutually exclusive eventsand holds that Pr [ ] = 0.As a classical example that explains the formal definitions, let us con-

    sider the experiment of throwing a fair die. The sample space consists ofall possible outcomes: = {1 2 3 4 5 6}. A particular outcome of theexperiment, say = 3, is a sample point . One may be interested inthe event where the outcome is even in which case = {2 4 6} and

    = {1 3 5}.If and are events, the union of these events can be written

    using set theory as

    = ( ) ( ) ( )

    because , and are mutually exclusive events. The relationis immediately understood by drawing a Venn diagram as in Fig. 2.1. Taking

    A B

    A B Ac BA Bc

    Fig. 2.1. A Venn diagram illustrating the union .

    the probability measure of the union yields

    Pr [ ] = Pr [( ) ( ) ( )]

    = Pr [ ] + Pr [ ] + Pr [ ] (2.1)

    where the last relation follows from Axiom 2. Figure 2.1 shows that =( ) ( ) and = ( ) ( ). Since the events aremutually exclusive, Axiom 2 states that

    Pr [ ] = Pr [ ] + Pr [ ]

    Pr [ ] = Pr [ ] + Pr [ ]

    Substitution into (2.1) yields the important relation

    Pr [ ] = Pr [ ] + Pr [ ] Pr [ ] (2.2)

    Although derived for the measure Pr [ ], relation (2.2) also holds for othermeasures, for example, the cardinality (the number of elements) of a set.

  • 12 Random variables

    2.1.1 The inclusion-exclusion formulaA generalization of the relation (2.2) is the inclusion-exclusion formula,

    Pr [ =1 ] =X1=1

    Pr [ 1 ]X1=1

    X2= 1+1

    Pr [ 1 2 ]

    +X1=1

    X2= 1+1

    X3= 2+1

    Pr [1 2 3

    ]

    + + ( 1) 1X1=1

    X2= 1+1

    X

    = 1+1

    Pr

    =1

    (2.3)

    The formula shows that the probability of the union consists of the sum ofprobabilities of the individual events (first term). Since sample points canbelong to more than one event , the first term possesses double countings.The second term removes all probabilities of samples points that belong toprecisely two event sets. However, by doing so (draw a Venn diagram), wealso subtract the probabilities of samples points that belong to three eventssets more than needed. The third term adds these again, and so on. Theinclusion-exclusion formula can be written more compactly as,

    Pr [ =1 ] =X=1

    ( 1) 1X1=1

    X2= 1+1

    X

    = 1+1

    Prh

    =1

    i(2.4)

    or with

    =X

    1 1 2

    Prh

    =1

    ias

    Pr [ =1 ] =X=1

    ( 1) 1 (2.5)

    Proof of the inclusion-exclusion formula3: Let = 1=1 and = such that

    3 Another proof (Grimmett and Stirzacker, 2001, p. 56) uses the indicator function defined inSection 2.2.1. Useful indicator function relations are

    1 = 1 1

    1 = 1 1

    1 = 1 1( ) = 1 1 = 1 1 1

    = 1 (1 1 )(1 1 ) = 1 + 1 + 1 1 = 1 + 1 + 1

    Generalizing the last relation yields

    1=1

    = 1=1

    (1 1 )

    Multiplying out and taking the expectations using (2.13) leads to (2.3).

  • 2.1 Probability theory and set theory 13

    = =1 and =1

    =1 =1

    =1 by the distributive law in set

    theory, then application of (2.2) yields the recursion in

    Pr [ =1 ] = Pr1

    =1 +Pr [ ] Pr1

    =1 (2.6)

    By direct substitution of 1, we have

    Pr 1=1 = Pr2

    =1 +Pr [ 1] Pr2

    =1 1

    while substitution in this formula of gives

    Pr 1=1 = Pr2

    =1 +Pr [ 1 ] Pr2

    =1 1

    Substitution of the last two terms into (2.6) yields

    Pr [ =1 ] = Pr [ 1] + Pr [ ] Pr [ 1 ] + Pr2

    =1

    Pr 2=1 1 Pr2

    =1 +Pr2

    =1 1

    (2.7)

    Similarly, in a next iteration we use (2.6) after suitable modification in the right-hand side of (2.7)to lower the upper index in the union,

    Pr 2=1 = Pr3

    =1 +Pr [ 2] Pr3

    =1 2

    Pr 2=1 1 = Pr3

    =1 1 +Pr [ 2 1]

    Pr 3=1 1 2

    Pr 2=1 = Pr3

    =1 +Pr[ 2 ] Pr3

    =1 2

    Pr 2=1 1 = Pr3

    =1 1 +Pr [ 2 1]

    Pr 3=1 1 2

    The result is

    Pr [ =1 ] = Pr [ 2] + Pr [ 1] + Pr [ ] + Pr [ 2 1] Pr [ 2 ]

    Pr [ 1 ] + Pr [ 2 1 ] + Pr3

    =1

    Pr 3=1 2 Pr3

    =1 1 Pr3

    =1

    +Pr 3=1 1 2 + Pr3

    =1 2

    +Pr 3=1 1 Pr3

    =1 1 2

    which starts revealing the structure of (2.3). Rather than continuing the iterations, we prove thevalidity of the inclusion-exclusion formula (2.3) via induction. In case = 2, the basic expression(2.2) is found. Assume that (2.3) holds for , then the case for + 1 must obey (2.6) where

    + 1,

    Pr +1=1 = Pr [ =1 ] + Pr [ +1] Pr [ =1 +1]

  • 14 Random variables

    Substitution of (2.3) into the above expression yields, after suitable grouping of the terms,

    Pr +1=1 =Pr[ +1] +

    1=1

    Pr1

    1=1 2= 1+1

    Pr1 2

    1=1

    Pr1 +1

    +

    1=1 2= 1+1 3= 2+1

    Pr1 2 3

    +

    1=1 2= 1+1

    Pr1 2 +1

    + + ( 1) 11=1 2= 1+1

    = 1+1

    Pr =1

    + + ( 1)1=1 2= 1+1

    = 1+1

    Pr =1 +1

    =

    +1

    1=1

    Pr [ ]

    +1

    1=1

    +1

    2= 1+1

    Pr1 2

    +

    +1

    1=1

    +1

    2= 1+1

    +1

    3= 2+1

    Pr 1 2 3

    + + ( 1)+1

    1=1

    +1

    2= 1+1

    +1

    +1= +1

    Pr =1 +1

    which proves (2.3).

    Although impressive, the inclusion-exclusion formula is useful when deal-ing with dependent random variables because of its general nature. In par-ticular, if Pr

    h=1

    i= and not a function of the specific indices ,

    the inclusion-exclusion formula (2.4) becomes more attractive,

    Pr [ =1 ] =X=1

    ( 1) 1X

    1 1 2

    1

    =X=1

    ( 1) 1

    An application of the latter formula to multicast can be found in Chapter17 and many others are in Feller (1970, Chapter IV). Sometimes it is usefulto reason with the complement of the union ( =1 ) = \ =1 =

    =1 . Applying Axiom 2 to ( =1 ) ( =1 ) = ,

    Pr [( =1 ) ] = Pr [ ] Pr [ =1 ]

    and using Axiom 1 and the inclusion-exclusion formula (2.5), we obtain

    Pr [( =1 ) ] = 1X=1

    ( 1) 1 =X=0

    ( 1) (2.8)

  • 2.1 Probability theory and set theory 15

    with the convention that 0 = 1. The Booles inequalities

    Pr [ =1 ]X=1

    Pr [ ] (2.9)

    Pr [ =1 ] 1X=1

    Pr [ ]

    are derived as consequences of the inclusion-exclusion formula (2.3). Only ifall events are mutually exclusive, the equality sign in (2.9) holds whilst theinequality sign follows from the fact that possible overlaps in events are, incontrast to the inclusion-exclusion formula (2.3), not subtracted.The inclusion-exclusion formula is of a more general nature and also ap-

    plies to other measures on sets than Pr [ ], for example to the cardinality asmentioned above. For the cardinality of a set , which is usually denotedby | |, the inclusion-exclusion variant of (2.8) is

    |( =1 ) | =X=0

    ( 1) | | (2.10)

    where the total number of elements in the sample space is | 0| = and

    | | =X

    1 1 2

    =1

    A nice illustration of the above formula (2.10) applies to the sieve of

    Eratosthenes (Hardy and Wright, 1968, p. 4), a procedure to construct thetable of prime numbers4 up to . Consider the increasing sequence ofintegers

    = {2 3 4 }and remove successively all multiples of 2 (even numbers starting from 4,6, ...), all multiples of 3 (starting from 32 and not yet removed previously),all multiples of 5, all multiples of the next number larger than 5 and still inthe list (which is the prime 7) and so on, up to all multiples of the largest

    possible prime divisor that is equal to or smaller thanh i

    . Here [ ] is thelargest integer smaller than or equal to . The remaining numbers in thelist are prime numbers. Let us now compute the number of primes ( )smaller than or equal to by using the inclusion-exclusion formula (2.10).

    4 An integer number is prime if 1 and has no other integer divisors than 1 and itself. The sequence of the first primes are 2, 3, 5, 7, 11, 13, etc. If and are divisors of ,

    then = from which it follows that and cannot exceed both . Hence, any compositenumber is divisible by a prime that does not exceed .

  • 16 Random variables

    The number of primes smaller than a real number is ( ) and, evidently,if denotes the -th prime, then ( ) = . Let denote the set of themultiples of the -th prime that belong to . The number of such setsin the sieve of Eratosthenes is equal to the largest prime number smallerthan or equal to

    h i, hence, =

    . If ( =1 ) , this means

    that is not divisible by each prime number smaller than and that isa prime number lying between . The cardinality of the set( =1 ) , the number of primes between is

    |( =1 ) | = ( )

    On the other hand, if =1 for 1 1 2 , thenis a multiple of 1 2 and the number of multiples of the integer

    1 2 in is

    1 2

    =

    =1

    Applying the inclusion-exclusion formula (2.10) with | | = 0 = 1 and=

    gives

    ( )

    = 1X=1

    ( 1)X

    1 1 21 2

    The knowledge of the prime numbers smaller than or equal toh i

    , i.e. the

    first =

    primes, su ces to compute the number of primes ( )smaller than or equal to without explicitly knowing the primes lyingbetween .

    2.2 Discrete random variables

    Discrete random variables are real functions defined on a discrete proba-bility space as : R with the property that the event

    { : ( ) = } Ffor each R. The event { : ( ) = } is further abbreviated as{ = }. A discrete probability density function (pdf) Pr[ = ] has thefollowing properties:

    (i) 0 Pr[ = ] 1 for real that are possible outcomes of an

  • 2.2 Discrete random variables 17

    experiment. The set of values can be finite or countably infiniteand constitute the discrete probability space.

    (ii)P

    Pr[ = ] = 1

    In the classical example of throwing a die, the discrete probability space= {1 2 3 4 5 6} and, since each of the six edges of the (fair) die is equally

    possible as outcome, Pr[ = ] = 16 for each .

    2.2.1 The expectation

    An important operator acting on a discrete random variable is the expec-tation, defined as

    [ ] =X

    Pr [ = ] (2.11)

    The expectation [ ] is also called the mean or average or first moment of. More generally, if is a discrete random variable and is a function,

    then = ( ) is also a discrete random variable with expectation [ ]equal to

    [ ( )] =X

    ( ) Pr [ = ] (2.12)

    A special and often used function in probability theory is the indicatorfunction 1 defined as 1 if the condition is true and otherwise it is zero.For example,

    [1 ] =X

    1 Pr [ = ] =X

    Pr [ = ] = Pr[ ]

    [1 = ] = Pr[ = ] (2.13)

    The higher moments of a random variable are defined as the case where( ) = ,

    [ ] =X

    Pr [ = ] (2.14)

    From the definition (2.11), it follows that the expectation is a linear operator,"X=1

    #=

    X=1

    [ ]

    The variance of is defined as

    Var[ ] =h( [ ])2

    i(2.15)

  • 18 Random variables

    The variance is always non-negative. Using the linearity of the expectationoperator and = [ ], we rewrite (2.15) as

    Var[ ] =

    2

    2 (2.16)

    Since Var[ ] 0, relation (2.16) indicates that

    2

    ( [ ])2. Oftenthe standard deviation, defined as =

    pVar [ ], is used. An interesting

    variational principle of the variance follows, for the variable , fromh( )2

    i=

    h( )2

    i+ ( )2

    which is minimized at = = [ ] with value Var[ ]. Hence, the bestleast square approximation of the random variable is the number [ ].

    2.2.2 The probability generating function

    The probability generating function (pgf) of a discrete random variableis defined, for complex , as

    ( ) =

    =X

    Pr [ = ] (2.17)

    where the last equality follows from (2.12). If is integer-valued and non-negative, then the pgf is the Taylor expansion of the complex function ( ).Commonly the latter restriction applies, otherwise the substitution =

    is used such that (2.17) expresses the Fourier series of

    . Theimportance of the pgf mainly lies in the fact that the theory of functions canbe applied. Numerous examples of the power of analysis will be illustrated.Concentrating on non-negative integer random variables ,

    ( ) =X=0

    Pr [ = ] (2.18)

    and the Taylor coe cients obey

    Pr [ = ] =1

    !

    ( )=0

    (2.19)

    =1

    2

    Z(0)

    ( )+1

    (2.20)

    where (0) denotes a contour around = 0. Both are inversion formulae5.Since the general form [ ( )] is completely defined when Pr[ = ] is

    5 A similar inversion formula for Fourier series exist (see e.g. Titchmarsh (1948)).

  • 2.2 Discrete random variables 19

    known, the knowledge of the pgf results in a complete alternative description,

    [ ( )] =X=0

    ( )

    !

    ( )=0

    (2.21)

    Sometimes it is more convenient to compute values of interest directly from(2.17) rather than from (2.21). For example, -fold di erentiation of ( ) =

    yields

    ( )=

    ( 1) ( + 1) = 1

    !

    such that

    =1

    !

    ( )=1

    (2.22)

    Similarly, let = , then

    ( )=

    from which the moments follow as

    [ ] =( )

    =0

    (2.23)

    and, more generally,

    [( ) ] =

    ( )

    =0

    (2.24)

    2.2.3 The logarithm of the probability generating function

    The logarithm of the probability generating function is defined as

    ( ) = log ( ( )) = log

    (2.25)

    from which (1) = 0 because (1) = 1. The derivative 0 ( ) =0 ( )( )

    shows that 0 (1) = 0 (1), while from 00 ( ) =00 ( )( )

    0 ( )( )

    2, it follows

    that 00 (1) = 00 (1) ( 0 (1))2. These first few derivatives are interestingbecause they are related directly to probabilistic quantities. Indeed, from(2.23), we observe that

    [ ] = 0 (1) = 0 (1) (2.26)

  • 20 Random variables

    and from [ 2] = 00 (1) + 0 (1)

    Var[ ] = 00 (1) + 0 (1) 0 (1)2

    = 00 (1) + 0 (1) (2.27)

    2.3 Continuous random variables

    Although most of the concepts defined above for discrete random variablesare readily transferred to continuous random variables, the calculus is ingeneral more di cult. Indeed, instead of reasoning on the pdf, it is moreconvenient to work with the probability distribution function defined forboth discrete and continuous random variables as

    ( ) = Pr [ ] (2.28)

    Clearly, we have lim ( ) = 0, while lim + ( ) = 1. Further,( ) is non-decreasing in and

    Pr [ ] = ( ) ( ) (2.29)

    This relation follows from the observations { } { } ={ } and { } { } = . For mutually exclusive events

    = , Axiom 2 in Section 2.1 states that Pr [ ] = Pr [ ] + Pr [ ]which proves (2.29). As a corollary of (2.29), ( ) is continuous at theright which follows from (2.29) by denoting = for any 0. Lessprecise, it follows from the equality sign at the right, , and inequalityat the left, . Hence, ( ) is not necessarily continuous at the leftwhich implies that ( ) is not necessarily continuous and that ( ) maypossess jumps. But even if ( ) is continuous, the pdf is not necessarycontinuous6.The pdf of a continuous random variable is defined as

    ( ) =( )

    (2.30)

    6 Weierstrass was the first to present a continuous non-di erentiable function,

    ( ) ==0

    cos ( )

    where 0 1 and is an odd positive integer. Since the series is uniformly convergentfor any , ( ) is continuous everywhere. Titchmarsh (1964, Chapter IX) demonstrates for

    1 + 32

    that ( + ) ( ) takes arbitrarily large values such that 0( ) does not exist.Another class of continuous non-di erentiable functions are the sample paths of a Brownianmotion. The Cantor function which is discussed in (Berger, 1993, p. 21) and (Billingsley, 1995,p. 407) is an other classical, noteworthy function with peculiar properties.

  • 2.3 Continuous random variables 21

    Assuming that ( ) is di erentiable at , from (2.29), we have for small,positive

    Pr [ + ] = ( + ) ( )

    =( )

    +( )2

    Using the definition (2.30) indicates that, if ( ) is di erentiable at ,

    ( ) = lim0

    Pr [ + ](2.31)

    If ( ) is finite, then lim 0 Pr [ + ] = Pr [ = ] = 0,which means that for well-behaved (i.e. ( ) is di erentiable for most )continuous random variables , the event that precisely equals is zero7.Hence, for well-behaved continuous random variables where Pr [ = ] = 0for all , the inequality signs in the general formula (2.29) can be relaxed,

    Pr [ ] = Pr [ ] = Pr [ ] = Pr [ ]

    If ( ) is not finite, then ( ) is not di erentiable at such that

    lim0

    ( + ) ( ) = ( ) 6= 0

    This means that ( ) jumps upwards at over ( ). In that case,there is a probability mass with magnitude ( ) at the point . Al-though the second definition (2.31) is strictly speaking not valid in thatcase, one sometimes denotes the pdf at = by ( ) = ( ) ( )where ( ) is the Dirac impulse or delta function with basic property thatR +

    ( ) = 1. Even apart from the above-mentioned di cultiesfor certain classes of non-di erentiable, but continuous functions, the factthat probabilities are always confined to the region [0,1] may suggest that0 ( ) 1. However, the second definition (2.31) shows that ( ) canbe much larger than 1. For example, if is a Gaussian random variablewith mean and variance 2 (see Section 3.2.3) then ( ) = 1

    2can be

    made arbitrarily large. In fact,

    lim0

    exp

    ( )2

    2 2

    2

    = ( )

    7 In Lesbesgue measure theory (Titchmarsh, 1964; Billingsley, 1995), it is said that a countable,finite or enumerable (i.e. function evaluations at individual points) set is measurable, but itsmeasure is zero.

  • 22 Random variables

    2.3.1 Transformation of random variables

    It frequently appears useful to know how to compute ( ) for = ( ).Only if the inverse function 1 exists, the event { ( ) } is equivalentto

    1( )

    if 0 and to

    1( )

    if 0. Hence,

    ( ) = Pr [ ( ) ] =

    ( 1( )

    0

    1

    1( )

    0(2.32)

    For well-behaved continuous random variables, we may rewrite (2.31) interms of di erentials,

    ( ) = Pr [ + ]

    and, similarly for ( ),

    ( ) = Pr [ = ( ) + ]

    If is increasing, then the event { ( ) + } is equivalent to1( ) 1( + )

    = { + } such that( ) = ( )

    If is deceasing, we find that ( ) = ( ) . Thus, if 1 and0 exists, then the relation between the pdf of a well-behaved continuous

    random variable and that of the transformed random variable = ( )is

    ( ) = ( )

    =

    ( )

    | 0 ( )|This expression also follows by straightforward di erentiation of (2.32). Thechi-square distribution introduced in Section 3.3.3 is a nice example of thetransformation of random variables.

    2.3.2 The expectation

    Analogously to the discrete case, we define the expectation of a continuousrandom variable as

    [ ] =

    Z( ) (2.33)

    In addition for the expectation to exist8, we requireR | | ( ) .

    If is a continuous random variable and is a continuous function, then8 This requirement is borrowed from measure theory and Lebesgue integration (Titchmarsh, 1964,

    Chapter X)(Royden, 1988, Chapter 4), where a measurable function is said to be integrable (inthe Lebesgue sense) over if + = max( ( ) 0) and = max( ( ) 0) are both integrableover . Although this restriction seems only of theoretical interest, in some applications (see the

  • 2.3 Continuous random variables 23

    = ( ) is also a continuous random variable with expectation [ ] equalto

    [ ( )] =

    Z( ) ( ) (2.34)

    It is often useful to express the expectation [ ] of a non-negative randomvariable in tail probabilities. Upon integration by parts,

    [ ] =

    Z0

    ( ) =

    Z( )

    0

    +

    Z0

    Z( )

    =

    Z0

    (1 ( )) (2.35)

    The case for a non-positive random variable is derived analogously,

    [ ] =

    Z 0( ) =

    Z( )

    0 Z 0 Z( )

    =

    Z 0( )

    The general case follows by addition:

    [ ] =

    Z0

    (1 ( ))

    Z 0( )

    A similar expression exists for discrete random variables. In general forany discrete random variable , we can write

    [ ] ==

    Pr [ = ] =

    1

    =

    Pr [ = ] +=0

    Pr [ = ]

    =

    1

    =

    (Pr [ ] Pr [ 1]) +=0

    (Pr [ ] Pr [ + 1])

    =

    1

    =

    Pr [ ]

    2

    =

    ( + 1)Pr [ ] +=1

    Pr [ ]=1

    ( 1)Pr [ ]

    = Pr [ 1]

    2

    =

    Pr [ ] +=1

    Pr [ ]

    Cauchy distribution defined in (3.38)) the Riemann integral may exists where the Lesbesguedoes not. For example, 0

    sin equals, in the Riemann sense,2

    (which is a standardexcercise in contour integration), but this integral does not exists in the Lesbesgue sense.Only for improper integrals (integration interval is infinite), Riemann integration may existwhere Lesbesgue does not. However, in most other cases (integration over a finite interval),Lesbesgue integration is more general. For instance, if ( ) = 1{ is rational}, then

    10 ( )

    does not exist in the Riemann sense (since upper and lower sums do not converge to eachother). However, 10 ( ) = 0 in the Lesbesgue sense (since there is only a set of measurezero di erent from 0, namely all rational numbers in [0 1] ). In probability theory and measuretheory, Lesbesgue integration is assumed.

  • 24 Random variables

    or the mean of a discrete random variable expressed in tail probabilitiesis9

    [ ] =X=1

    Pr [ ]1X

    =

    Pr [ ] (2.36)

    2.3.3 The probability generating function

    The probability generating function (pgf) of a continuous random variableis defined, for complex , as the Laplace transform

    ( ) =

    =

    Z( ) (2.37)

    Again, in some cases, it may be more convenient to use = in which casethe double sided Laplace transform reduces to a Fourier transform. Thestrength of these transforms is based on the numerous properties, especiallythe inverse transform,

    ( ) =1

    2

    Z +( ) (2.38)

    where is the smallest real variable Re( ) for which the integral in (2.37)converges. Similarly as for discrete random variables, we have ( ) =

    ( )

    [( ) ] = ( 1)( ( ))

    =0

    (2.39)

    The main di erence with the discrete case lies in the definition

    (continuous) versus

    (discrete). Since the exponential is an entire

    9 We remark that

    [ ] ==

    Pr [ = ] ==

    (Pr [ ] Pr [ + 1])

    6==

    Pr [ ]=

    Pr [ + 1] ==

    Pr [ ]

    because the series in the second line are diverging. In fact, there exists a finite integer suchthat, for any real arbitrarily small 0 holds that Pr [ ] = 1 and Pr [ ]Pr [ ] for all . Hence,

    [ ] ==

    Pr [ ] +=

    Pr [ ] (1 )=

    1 +

    where = Pr [ ] = is finite. Also, even for negative , = Pr [ ] is alwayspositive.

  • 2.3 Continuous random variables 25

    function10 with power series around = 0, =P

    =0( 1)

    ! , theexpectation and summation can be reversed leading to

    =

    X=0

    ( 1) !

    (2.40)

    provided11

    = ( !) which is a necessary condition for the sum-mation to converge for 6= 0. Assuming convergence12, the Taylor seriesof

    around = 0 is expressed as function of the moments of ,

    whereas in the discrete case, the Taylor series of

    around = 0 givenby (2.18) is expressed in terms of probabilities of . This observation has ledto call

    sometimes the moment generating function, while

    is the probability generating function of the random variable . On theother hand, series expansion of

    around = 1,

    ( ) =X=0

    Pr [ = ] ( + 1 1) =X=0

    Pr [ = ]X=0

    ( 1)

    =X=0

    X=

    Pr [ = ] ( 1)

    shows with (2.22) that =

    X=

    Pr [ = ]

    If moments are desired, the substitution in

    is appropriate.

    2.3.4 The logarithm of the probability generating function

    The logarithm of the probability generating function is defined as

    ( ) = log ( ( )) = log

    (2.41)

    10 An entire (or integral) function is a complex function without singularities in the finite complexplane. Hence, a power series around any finite point has infinite radius of convergence. In otherwords, it exists for all finite complex values.

    11 The Landau big -notation specifies the order of a function when the argument tends to somelimit. Most often the limit is to infinity, but the -notation can also be used to characterizethe behavior of a function around some finite point. Formally, ( ) = ( ( )) formeans that there exist positive numbers and 0 for which | ( )| | ( )| for 0.

    12 The lognormal distribution defined by (3.43) is an example where the summation (2.40) divergesfor any 6= 0.

  • 26 Random variables

    from which (0) = 0 because (0) = 1. Further, analogous to thediscrete case, we see that 0 (0) = 0 (0), 00 (0) = 00 (0) ( 0 (0))2 and

    [ ] = 0 (0) = 0 (0)

    However, the di erence with the discrete case lies in the higher moments,

    [ ] = ( 1)( )

    =0

    (2.42)

    because with [ 2] = 00 (0),

    Var[ ] = 00 (0) 0 (0)2

    = 00 (0) (2.43)

    The latter expression makes ( ) for a continuous random variable par-ticularly useful. Since the variance is always positive, it demonstrates that

    ( ) is convex (see Section 5.5) around = 0. Finally, we mention that( [ ])3

    = 000(0)

    2.4 The conditional probability

    The conditional probability of the event given the event (or on thehypothesis ) is defined as

    Pr [ | ] = Pr [ ]Pr [ ]

    (2.44)

    The definition implicitly assumes that the event has positive probability,otherwise the conditional probability remains undefined. We quote Feller(1970, p. 116):

    Taking conditional probabilities of various events with respect to a particular hy-pothesis amounts to choosing as a new sample space with probabilities pro-portional to the original ones; the proportionality factor Pr[ ] is necessary in orderto reduce the total probability of the new sample space to unity. This formu-lation shows that all general theorems on probabilities are valid for conditionalprobabilities with respect to any particular hypothesis. For example, the lawPr [ ] = Pr [ ] + Pr [ ] Pr [ ] takes the form

    Pr [ | ] = Pr [ | ] + Pr [ | ] Pr [ | ]

    The formula (2.44) is often rewritten in the form

    Pr [ ] = Pr [ | ] Pr [ ] (2.45)

  • 2.4 The conditional probability 27

    which easily generalizes to more events. For example, denote = 1 and= 2 3, then

    Pr [ 1 2 3] = Pr [ 1| 2 3] Pr [ 2 3]= Pr [ 1| 2 3] Pr [ 2| 3] Pr [ 3]

    Another application of the conditional probability occurs when a partition-ing of the sample space is known: = and all are mutuallyexclusive, which means that = for any and 6= . Then, with(2.45), X

    Pr [ ] =X

    Pr [ | ] Pr [ ]

    The event = { } is a decomposition (or projection) of the eventin the basis event , analogous to the decomposition of a vector in termsof a set of orthogonal basis vectors that span the total state space. Indeed,using the associative property { } = and = ,the intersection = { } { } = { } = ,which implies mutual exclusivity (or orthogonality). Using the distributiveproperty { } = { } { }, we observe that

    =

    = { } = { } =Finally, since all events are mutually exclusive, Pr [ ] =

    PPr [ ] =P

    Pr [ ]. Thus, if = and in addition, for any pair holdsthat = , we have proved the law of total probability or decompos-ability,

    Pr [ ] =X

    Pr [ | ] Pr [ ] (2.46)

    Conditioning on events is a powerful tool that will be used frequently. Ifthe conditional probability Pr [ | ] is known as a function ( ), the lawof total probability can also be written in terms of the expectation operatordefined in (2.12) as

    Pr [ ] = [ ( )] (2.47)

    Also the important memoryless property of the exponential distribution (seeSection 3.2.2) is an example of the application of the conditional probability.Another classical example is Bayes rule. Consider again the eventsdefined above. Using the definition (2.44) followed by (2.45),

    Pr [ | ] = Pr [ ]Pr [ ]

    =Pr [ ]

    Pr [ ]=

    Pr [ | ] Pr [ ]Pr [ ]

    (2.48)

  • 28 Random variables

    Using (2.46), we arrive at Bayes rule

    Pr [ | ] = Pr [ | ] Pr [ ]PPr [ | ] Pr [ ] (2.49)

    where Pr [ ] are called the a-priori probabilities, while Pr [ | ] are thea-posteriori probabilities.The conditional distribution function of the random variable given

    is defined by

    | ( | ) = Pr [ | = ] (2.50)for any provided Pr [ = ] 0. This condition follows from the definition(2.44) of the conditional probability. The conditional probability densityfunction of given is defined by

    | ( | ) = Pr [ = | = ] =Pr[ = = ]

    Pr [ = ]

    =( )

    ( )(2.51)

    for any such that Pr [ = ] 0 (and similarly for continuous randomvariables ( ) 0) and where ( ) is the joint probability densityfunction defined below in (2.59).

    2.5 Several random variables and independence

    2.5.1 Discrete random variables

    Two events and are independent if

    Pr [ ] = Pr [ ] Pr [ ] (2.52)

    Similarly, we define two discrete random variables to be independent if

    Pr [ = = ] = Pr [ = ]Pr [ = ] (2.53)

    If = ( ), then is a discrete random variable with

    Pr [ = ] =X

    ( )=

    Pr [ = = ]

    Applying the expectation operator (2.11) to both sides yields

    [ ( )] =X

    ( ) Pr [ = = ] (2.54)

  • 2.5 Several random variables and independence 29

    If and are independent and is separable, ( ) = 1( ) 2( ), thenthe expectation (2.54) reduces to

    [ ( )] =X

    1( ) Pr [ = ]X

    2( ) Pr [ = ] = [ 1( )] [ 2( )]

    (2.55)The simplest example of the general function is = + . In that case,

    the sum is over all and that satisfy + = . Thus,

    Pr [ + = ] =X

    Pr [ = = ] =X

    Pr [ = = ]

    If and are independent, we obtain the convolution,

    Pr [ + = ] =X

    Pr [ = ]Pr [ = ]

    =X

    Pr [ = ]Pr [ = ]

    2.5.2 The covariance

    The covariance of and is defined as

    Cov [ ] = [( ) ( )] = [ ] (2.56)

    If Cov[ ] = 0, then the variables and are uncorrelated. If andare independent, then Cov[ ] = 0. Hence, independence implies uncor-relation, but the converse is not necessarily true. The classical example13 is

    = 2 where has a normal distribution (0 1) (Section 3.2.3) because= 0 and

    2=

    3= 0 as follows from (3.23). Although

    and are perfect dependent, they are uncorrelated. Thus, independence isa stronger property than uncorrelation. The covariance Cov[ ] measuresthe degree of dependence between two (or generally more) random variables.If and are positively (negatively) correlated, the large values of tendto be associated with large (small) values of .As an application of the covariance, consider the problem of computing the

    variance of a sum of random variables 1 2 . Let = [ ],

    13 Another example: let be uniform on [0 1] and = cos(2 ) and = sin (2 ). Using(2.34),

    [ ] =1

    0cos(2 ) sin (2 ) = 0

    as well as [ ] = [ ] = 0. Thus, Cov[ ] = 0, but and are perfectly dependentbecause = cos (arcsin ) = 1 2.

  • 30 Random variables

    then [ ] =P

    =1 and

    Var [ ] =h( [ ])2

    i=

    X=1

    ( )

    !2

    =X=1

    X=1

    ( )( )

    =X=1

    ( )2 + 2X=1

    X= +1

    ( )( )

    Using the linearity of the expectation operator and the definition of thecovariance (2.56) yields

    Var [ ] =X=1

    Var [ ] + 2X=1

    X= +1

    Cov [ ] (2.57)

    Observe that for a set of independent random variables { } the doublesum with covariances vanishes.The Cauchy-Schwarz inequality (5.17) derived in Chapter 5 indicates that

    ( [( ) ( )])2h( )2

    i h( )2

    isuch that the covariance is always bounded by

    |Cov [ ]|

    2.5.3 The linear correlation coe cient

    Since the covariance is not dimensionless, the linear correlation coe cientdefined as

    ( ) =Cov [ ]

    (2.58)

    is often convenient to relate two (or more) di erent physical quantities ex-pressed in di erent units. The linear correlation coe cient remains invariant(possibly apart from the sign) under a linear transformation because

    ( + + ) = sign( ) ( )

    This transform shows that the linear correlation coe cient ( ) is inde-pendent of the value of the mean and the variance 2 provided 2 0.Therefore, many computations simplify if we normalize the random variableproperly. Let us introduce the concept of a normalized random variable

  • 2.5 Several random variables and independence 31

    = . The normalized random variable has a zero mean and avariance equal to one. By the invariance under a linear transform, the cor-relation coe cient ( ) = ( ) and also ( ) = Cov[ ].The variance of follows from (2.57) as

    Var [ ] = Var[ ] +Var[ ] 2 Cov [ ]= 2(1 ( ))

    Since the variance is always positive, it follows that 1 ( ) 1.The extremes ( ) = 1 imply a linear relation between and . In-deed, ( ) = 1 implies that Var[ ] = 0, which is only possible if

    = + , where is a constant. Hence, = + 0. A similar argu-ment applies for the case ( ) = 1. For example, in curve fitting, thegoodness of the fit is often expressed in terms of the correlation coe cient.A perfect fit has correlation coe cient equal to 1. In particular, in linearregression where = + , the regression coe cients and are theminimizers of the square distance

    h( ( + ))2

    iand given by

    =Cov [ ]

    2

    = [ ] [ ]

    Since a correlation coe cient ( ) = 1 implies Cov[ ] = , wesee that = as derived above with normalized random variables.Although the linear correlation coe cient is a natural measure of the

    dependence between random variables, it has some disadvantages. First,the variances of and must exist, which may cause problems withheavy-tailed distributions. Second, as illustrated above, dependence canlead to uncorrelation, which is awkward. Third, linear correlation is notinvariant under non-linear strictly increasing transformations such that( ( ) ( )) 6= ( ). Common intuition expects that dependence mea-

    sures should be invariant under these transforms . This leads to the defi-nition of rank correlation which satisfies that invariance property. Here, wemerely mention Spermans rank correlation coe cient, which is defined as

    ( ) = ( ( ) ( ))

    where is the linear correlation coe cient and where the non-linear strictincreasing transform is the probability distribution. More details are foundin Embrechts et al. (2001b) and in Chapter 4.

  • 32 Random variables

    2.5.4 Continuous random variables

    We define the joint distribution function by ( ) = Pr [ ]and the joint probability density function by

    ( ) =2 ( )

    (2.59)

    Hence,

    ( ) = Pr [ ] =

    Z Z( ) (2.60)

    The analogon of (2.54) is

    [ ( )] =

    Z Z( ) ( ) (2.61)

    Most of the di culties occur in the evaluation of the multiple integrals. Thechange of variables in multiple dimensions involves the Jacobian. Considerthe transformed random variables = 1 ( ) and = 2 ( ) anddenote the inverse transform by = 1( ) and = 2 ( ), then

    ( ) = ( 1( ) 1( )) ( )

    where the Jacobian ( ) is

    ( ) = det

    If and are independent and = + , we obtain the convolution,

    ( ) =

    Z( ) ( ) =

    Z( ) ( ) (2.62)

    which is often denoted by ( ) = ( )( ). If both ( ) = 0 and( ) = 0 for 0, then the definition (2.62) of the convolution reduces to

    ( )( ) =

    Z0

    ( ) ( )

    2.5.5 The sum of independent random variables

    Let =P

    =1 , where the random variables are all independent.We first concentrate on the case where = is a (fixed) integer. Since

    = 1 + , direct application of (2.62) yields the recursion

    ( ) =

    Z1( ) ( ) (2.63)

  • 2.5 Several random variables and independence 33

    which, when written out explicitly, leads to the -fold integral

    ( ) =

    Z( )

    Z1( 1) 0( 1) 1 (2.64)

    In many cases, convolutions are more e ciently computed via generatingfunctions. The generating function of equals

    ( ) =

    =h

    =1

    i=

    "Y=1

    #

    Since all are independent, (2.55) can be applied,

    ( ) =Y=1

    or, in terms of generating functions,

    ( ) =Y=1

    ( ) (2.65)

    Hence, we arrive at the important result that the generating function ofa sum of independent random variables equals the product of the gener-ating functions of the individual random variables. We also note that thecondition of independence is crucial in that it allows the product and expec-tation operator to be reversed, leading to the useful result (2.65). Often, therandom variables all possess the same distribution. In this case of in-dependent identically distributed (i.i.d.) random variables with generatingfunction ( ), the relation (2.65) further simplifies to

    ( ) = ( ( )) (2.66)

    In the case where the number of terms in the sum is a randomvariable with generating function ( ), independent of the , we use thegeneral definition of expectation (2.54) for two random variables,

    ( ) =

    =X=0

    XPr [ = = ]

    =X=0

    XPr [ = | = ]Pr [ = ]

    where the conditional probability (2.45) is used. Since the value of

  • 34 Random variables

    depends on the number of terms in the sum, we have Pr [ = | = ] =Pr [ = ]. Further, withX

    Pr [ = | = ] = ( )

    we have

    ( ) =X=0

    ( ) Pr [ = ] (2.67)

    The average [ ] follows from (2.26) as

    [ ] =X=0

    0 (1)Pr [ = ] =X=0

    [ ] Pr [ = ] (2.68)

    Since [ ] =hP

    =1

    i=

    P=1 [ ] and assuming that all random

    variables have equal mean [ ] = [ ], we have

    [ ] =X=0

    [ ] Pr [ = ]

    or

    [ ] = [ ] [ ] (2.69)

    This relation (2.69) is commonly called Walds identity. Walds identityholds for any random sum of (possibly dependent) random variablesprovided the number of those random variables is independent of the .In the case of i.i.d. random variables, we apply (2.66) in (2.67) so that

    ( ) =X=0

    ( ( )) Pr [ = ] = ( ( )) (2.70)

    This expression is a generalization of (2.66).

    2.6 Conditional expectation

    The generating function (2.67) of a random sum of independent random vari-ables can be derived using the conditional expectation [ | = ] of tworandom variables and . We will first define the conditional expectationand derive an interesting property.Suppose that we know that = , the conditional density function

  • 2.6 Conditional expectation 35

    | ( | ) defined by (2.51) of the random variable = | can be re-garded as only function of . Using the definition of the expectation (2.33)for continuous random variables (the discrete case is analogous), we have

    [ | = ] =Z

    | ( | ) (2.71)

    Since this expression holds for any value of that the random variablecan take, we see that [ | = ] = ( ) is a function of and, in

    addition since = , [ | = ] = ( ) can be regarded as a randomvariable that is a function of the random variable . Having identified theconditional expectation [ | = ] as a random variable, let us compute itsexpectation or the expectation of the slightly more general random variable( ) ( ) with ( ) = [ | = ]. From the general definition (2.34)

    of the expectation, it follows that

    [ ( ) ( )] =

    Z( ) ( ) ( ) =

    Z( ) [ | = ] ( )

    Substituting (2.71) yields

    [ ( ) ( )] =

    Z Z( ) | ( | ) ( )

    =

    Z Z( ) ( ) = [ ( ) ]

    where we have used (2.51) and (2.61). Thus, we find the interesting relation

    [ ( ) [ | = ]] = [ ( ) ] (2.72)As a special case where ( ) = 1, the expectation of the conditional expec-tation follows as

    [ ] = [ [ | = ]]where the index in clarifies that the expectation is over the randomvariable . Applying this relation to = where =

    P=1 and

    all are independent yields

    ( ) =

    = | =

    Since | = = ( ) and specified in (2.65), we end up with

    ( ) = [ ( )] =X=0

    ( ) Pr [ = ]

    which is (2.67).

  • 3

    Basic distributions

    This chapter concentrates on the most basic probability distributions andtheir properties. From these basic distributions, other useful distributionsare derived.

    3.1 Discrete random variables

    3.1.1 The Bernoulli distribution

    A Bernoulli random variable can only take two values: either 1 withprobability or 0 with probability = 1 . The standard example ofa Bernoulli random variable is the outcome of tossing a biased coin, and,more generally, the outcome of a trial with only two possibilities, eithersuccess or failure. The sample space is = {0 1} and Pr[ = 1] = , whilePr [ = 0] = . From this definition, the pgf follows from (2.17) as

    ( ) =

    = 0 Pr [ = 0] + 1 Pr [ = 1]

    or

    ( ) = + (3.1)

    From (2.23) or (2.14), the -th moment is

    [ ] =

    which shows that = [ ] = . From (2.24), we find [( ) ] =(1 ) + ( ) such that the moments centered around the mean are

    [( ) ] = + ( 1) =

    1 + ( 1) 1

    Explicitly, with + = 1, Var[ ] = and( )3

    = ( ).

    37

  • 38 Basic distributions

    3.1.2 The binomial distribution

    A binomial random variable is the sum of independent Bernoulli randomvariables. The sample space is = {0 1 }. For example, mayrepresent the number of successes in independent Bernoulli trials such asthe number of heads after -times tossing a (biased) coin. Application of(2.66) with (3.1) gives

    ( ) = ( + ) (3.2)

    Expanding the binomial pgf in powers of , which justifies the name bino-mial,

    ( ) =X=0

    and comparing to (2.18) yields

    Pr[ = ] =

    (3.3)

    The alternative, probabilistic approach starts with (3.3). Indeed, the prob-ability that has successes out of trials consists of precisely successes(an event with probability ) and failures (with probability equal to

    ). The total number of ways in which successes out of trials can beobtained is precisely

    .

    The mean follows from (2.23) or from the definition =P

    =1 Bernoulli

    and the linearity of the expectation as [ ] = . Higher order momentsaround the mean can be derived from (2.24) as

    [( ) ] =

    +

    =0

    =X=0

    ( )

    =0

    =X=0

    ( )

    In general, this form seems di cult to express more elegantly. It illustratesthat, even for simple random variables, computations may rapidly becomeunattractive. For = 2, the above di erentiation leads to Var[ ] = .But, this result is more economically obtained from (2.27), since ( ) =log ( + ), 0 ( ) = + and

    00 ( ) =2

    ( + )2 . Thus,

    Var [ ] = 2 + = (3.4)

  • 3.1 Discrete random variables 39

    3.1.3 The geometric distribution

    The geometric random variable returns the number of independent Bernoullitrials needed to achieve the first success. Here the sample space is theinfinite set of integers. The probability density function is

    Pr [ = ] = 1 (3.5)

    because a first success (with probability ) obtained in the -th trial isproceeded by 1 failures (each having probability = 1 ). Clearly,Pr [ = 0] = 0. The series expansion of the probability generating function,

    ( ) =X=0

    =1

    (3.6)

    justifies the name geometric.The mean [ ] = 0 (1) equals [ ] = 1 . The higher-order moments

    can be deduced from (2.24) as

    [( ) ] =

    1

    !=0

    =X=0

    ( )

    Similarly as for the binomial random variable, the variance most easily fol-lows from (2.27) with ( ) = log +log ( ) log(1 ), 0 ( ) = 1+ 1 ,00 ( ) = 12 +

    2

    (1 )2 . Thus,

    Var [ ] =2

    2+ =

    2(3.7)

    The distribution function ( ) = Pr [ ] =P

    =1 Pr [ = ] is ob-tained as

    Pr [ ] =1X

    =0

    =1

    1= 1

    The tail probability is

    Pr[ ] = (3.8)

    Hence, the probability that the number of trials until the first success islarger than decreases geometrically in with rate . Let us now consideran important application of the conditional probability. The probabilitythat, given the success is not found in the first trials, success does notoccur within the next trials, is with (2.44)

    Pr[ + | ] = Pr [{ + } { }]Pr [ ]

    =Pr [ + ]

    Pr [ ]

  • 40 Basic distributions

    and with (3.8)

    Pr[ + | ] = Pr[ ]This conditional probability turns out to be independent of the hypothesis,the event { }, and reflects the famous memoryless property . Only be-cause Pr[ ] obeys the functional equation ( + ) = ( ) ( ), thehypothesis or initial knowledge does not matter. It is precisely as if pastfailures have never occurred or are forgotten and as if, after a failure, thenumber of trials is reset to 0. Furthermore, the only solution to the func-tional equation is an exponential function. Thus, the geometric distributionis the only discrete distribution that possesses the memoryless property.

    3.1.4 The Poisson distribution

    Often we are interested to count the number of occurrences of an eventin a certain time interval, such as, for example, the number of IP packetsduring a time slot or the number of telephony calls that arrive at a telephoneexchange per unit time. The Poisson random variable with probabilitydensity function

    Pr [ = ] =!

    (3.9)

    turns out to model many of these counting phenomena well as shown inChapter 7. The corresponding generating function is

    ( ) =X=0

    != ( 1) (3.10)

    and the average number of occurrences in that time interval is

    [ ] = (3.11)

    This average determines the complete distribution. In applications it isconvenient to replace the unit interval by an interval of arbitrary lengthsuch that

    Pr [ = ] =( )

    !

    equals the probability that precisely events occur in the interval withduration . The probability that no events occur during time units isPr [ = 0] = and the probability that at least one event (i.e. one ormore) occurs is Pr [ 0] = 1 . The latter is equal to the exponen-tial distribution. We will also see later in Theorem 7.3.2 that the Poisson

  • 3.1 Discrete random variables 41

    counting process and the exponential distribution are intimately connected.The sum of independent Poisson random variables each with mean isagain a Poisson random variable with mean

    P=1 as follows from (2.65)

    and (3.10).The higher-order moments can be deduced from (2.24) as

    [( ) ] =

    ( )

    =0

    from which

    [ ] = Var[ ] =( )3

    =

    The Poisson tail distribution equals

    Pr [ ] = 1X=0

    !

    which precisely equals the sum of exponentially distributed variables asdemonstrated below in Section 3.3.1.The Poisson density approximates the binomial density (3.3) if

    but the mean = . This phenomenon is often referred to as the lawof rare events: in an arbitrarily large number of independent trials eachwith arbitrarily small success = , the total number of successes willapproximately be Poisson distributed.

    The classical argument is to consider the binomial density (3.3) with =

    Pr[ = ] =!

    !( )!1

    =!

    1

    1

    =1

    1 1

    or

    log (Pr[ = ]) = log!

    log 1 +

    1

    =1

    log 1 + log 1

    For large , we use the Taylor expansion log 1 =2

    2 2+ 3 to obtain up to order

    2

    log (Pr[ = ]) = log!

    + + 2( 1)

    2+ 2

    2

    2+ 2

    = log!

    1

    2( )2 + 2

    With = 1 + + ( 2), we finally obtain the approximation for large ,

    Pr[ = ] =!

    11

    2( )2 + 2

  • 42 Basic distributions

    The coe cient of 1 is negative if + 12

    + 14

    + 12+ + 1

    4. In that -interval,

    the Poisson density is a lower bound for the binomial density for large and = . The reverse

    holds for values of outside that interval. Since for the Poisson density Pr[ = ]Pr[ = 1]

    = , we

    see that Pr[ = ] increases as and decreases as . Thus, the maximum of the

    Poisson density lies around = = [ ]. In conclusion, we can say that the Poisson density

    approximates the binomial density for large and = from below in the region of about the

    standard deviation around the mean [ ] = and from above outside this region (in the

    tails of the distribution).

    A much shorter derivation anticipates results of Chapter 6 and starts fromthe probability generating function (3.2) of the binomial distribution aftersubstitution of = ,

    lim ( ) = lim

    1 +

    ( 1)

    = ( 1)

    Invoking the Continuity Theorem 6.1.3, comparison with (3.10) shows thatthe limit probability generating function corresponds to a Poisson distribu-tion. The SteinChen (1975) Theorem1 generalizes the law of rare events:this law even holds when the Bernoulli trials are weakly dependent.As a final remark, let be the sum of i.i.d. Bernoulli trials each with

    mean , then is binomially distributed as shown in Section 3.1.2. Ifis a constant and independent of the number of trials , the Central LimitTheorem 6.3.1 states that

    (1 )tends to a Gaussian distribution. In

    summary, the limit distribution of a sum of Bernoulli trials depends onhow the mean varies with the number of trials when :

    if = , then !

    if is constant, then(1 )

    2

    2

    2

    1 The proof (see e.g. Grimmett and Stirzacker (2001, pp. 130132)) involves coupling theoryof stochastic random variables. The degree of dependence is expressed in terms of the totalvariation distance. The total variation distance between two discrete random variables and

    is defined as

    ( ) = |Pr [ = ] Pr [ = ]|

    and satisfies

    ( ) = 2 supZ

    |Pr [ ] Pr [ ]|

  • 3.2 Continuous random variables 43

    3.2 Continuous random variables

    3.2.1 The uniform distribution

    A uniform random variable has equal probability to attain any value inthe interval [ ] such that the probability density function is a constant.Since Pr[ ] =

    R( ) = 1, the constant value equals

    ( ) =1

    1 [ ] (3.12)

    where 1 is the indicator function defined in Section 2.2.1. The distributionfunction then follows as

    Pr [ ] = 1 [ ] + 1

    The Laplace transform (2.37) is2

    ( ) =

    Z( ) =

    ( )(3.13)

    while the mean = [ ] most easily follows from

    [ ] =

    Z1 [ ] =

    +

    2

    The centered moments are obtained from (2.39) as

    [( ) ] =( 1)

    2( )

    2( )

    =0

    =2( 1) sinh( 2 )

    =0

    Using the power series

    sinh( 2 ) =X=0

    ( 2 )2 +1

    (2 + 1)!2

    leads to ( )2

    =

    ( )2

    (2 + 1)22(3.14)

    ( )2 +1= 0

    2 Notice that ( ) equals the convolution of two exponential densities and with ratesand , respectively.

  • 44 Basic distributions

    Let us define as the uniform random variable in the interval [0 1]. If= 1 is a uniform random variable on [0 1], then and have the

    same distribution denoted as = because Pr[ ] = Pr[1 ] =Pr[ 1 ] = 1 (1 ) = = Pr [ ]

    The probability distribution function ( ) = Pr[ ] = ( ) whoseinverse exists can be written as a function of ( ) = 1 [0 1]. Let =

    1( ). Since the distribution function is non-decreasing, this also holds forthe inverse 1( ). Applying (2.32) yields with = 1( )

    ( ) = Pr

    1( )= Pr [ ( )] = ( ( )) = ( )

    For instance, 1( ) = ln(1 ) = ln are exponentially random vari-ables (3.17) with parameter ; 1( ) = 1 are polynomially distributedrandom variables with distribution Pr [ ] = ; 1( ) = cot( ) isa Cauchy random variable defined in (3.38) below. In addition, we observethat = ( ) = ( ), which means that any random variable istransformed into a uniform random variable on [0 1] by its own distribu-tion function.The numbers that satisfy congruent recursions of the form +1 =

    ( + )mod , where is a large prime number (e.g. = 231 1), andare integers (e.g. = 397 204 094 and = 0) are to a good approximation

    uniformly distributed. The scaled numbers = 1 are nearly uniformlydistributed on [0 1]. Since these recursions with initial value or seed 0[0 1] are easy to generate with computers (Press et al., 1992), the aboveproperty is very useful to generate arbitrary random variables = 1( )from the uniform random variable .

    3.2.2 The exponential distribution

    An exponential random variable satisfies the probability density function

    ( ) = 0 (3.15)

    where is the rate at which events occur. The corresponding Laplace trans-form is

    ( ) =

    Z0

    =+

    (3.16)

    and the probability distribution is, for 0,

    ( ) = 1 (3.17)

  • 3.2 Continuous random variables 45

    The mean or average follows from (2.33) or from [ ] = 0 (0) as =[ ] = 1 . The centered moments are obtained from (2.39) as

    1

    = ( 1)

    +

    =0

    Since the Taylor expansion of + around = 0 is

    +=

    X=0

    1

    !

    X=0

    ( 1)

    =X=0

    1 X

    =0

    ( 1)

    !

    !we find that

    1

    =! X

    =0

    ( 1)

    !(3.18)

    For large , the centered moments are well approximated by1

    ' !

    The exponential random variable possesses, just as its discrete counter-part, the geometric random variable, thememoryless property . Indeed, anal-ogous to Section 3.1.3, consider

    Pr[ + | ] = Pr [{ + } { }]Pr [ ]

    =Pr [ + ]

    Pr [ ]

    and since Pr [ ] = , the memoryless property

    Pr[ + | ] = Pr[ ]is established. Since the only non-zero solution (proved in Feller (1970,p. 459)) to the functional equation ( + ) = ( ) ( ), which impliesthe memoryless property, is of the form , it shows that the exponentialdistribution is the only continuous distribution that has the memorylessproperty. As we will see later, this memoryless property is a fundamentalproperty in Markov processes.It is instructive to show the close relation between the geometric and

    exponential random variable (see Feller (1971, p. 1)). Consider the waitingtime (measured in integer units of ) for the first success in a sequence ofBernoulli trials where only one trial occurs in a timeslot . Hence, =is a (dimensionless) geometric random variable. From (3.8), Pr[ ] =(1 ) and the average waiting time is [ ] = [ ] = . The

  • 46 Basic distributions

    transition from the discrete to continuous space involves the limit process0 subject to a fixed average waiting time [ ]. Let = , then

    lim0Pr[ ] = lim

    0

    1

    [ ]

    = [ ]

    For arbitrary small time units, the waiting time for the first success andwith average [ ] turns out to be an exponential random variable.

    3.2.3 The Gaussian or normal distribution

    The Gaussian random variable is defined for all by the probabilitydensity function

    ( ) =1

    2exp

    ( )2

    2 2

    (3.19)

    which explicitly shows its dependence on the average and variance 2. Theimportance of the Gaussian random variables stems from the Central LimitTheorem 6.3.1. Often a Gaussian also called normal random variablewith average and variance 2 is denoted by ( 2). The distributionfunction is

    ( ) =1

    2

    Zexp

    ( )2

    2 2

    (3.20)

    where3 ( ) = 12

    R 22 is the normalized Gaussian distribution cor-

    responding to = 0 and = 1. The double-sided Laplace transform is

    ( ) =1

    2

    Zexp

    ( )2

    2 2

    =

    2 2

    2 (3.22)

    3 Abramowitz and Stegun (1968, Section 7.1.1) define the error function as

    erf ( ) =2

    0

    2(3.21)

    such that (Abramowitz and Stegun, 1968, Section 7.1.22)

    1

    2exp

    ( )2

    2 2=

    1

    21 + erf

    2

  • 3.3 Derived distributions 47

    and the centered moments (2.39) are

    ( )2

    =

    2

    2 2

    2

    2

    =0

    =(2 )!

    !

    2

    2

    ( )2 +1

    = 0 (3.23)

    We note from (2.65) that a sum of independent Gaussian random variables( 2) is again a Gaussian random variable

    P=1

    P=1

    2. If

    = ( 2), then the scaled random variable = is a ( ( )2)random variable that is verified by computing Pr [ ] = Pr

    .

    Similarly for translation, = + , then = ( + 2). Hence, alinear combination of Gaussian random variables is again a Gaussian randomvariable, X

    =1

    ( 2) + =

    X=1

    +X=1

    2 2

    !

    3.3 Derived distributions

    From the basic distributions, a large number of other distributions can bederived as illustrated here.

    3.3.1 The sum of independent exponential random variables

    By applying (2.65) and (2.38) a substantial amount of practical problemscan be solved. For example, the sum of independent exponential randomvariables, each with di erent rate 0, has the generating function

    ( ) =Y=1

    +

    and probability density function

    ( ) =

    Q=1

    2

    Z + Q=1( + )

    The contour can be closed over the negative half plane for 0, where theintegral has simple poles at = . From the Cauchy integral theorem,we obtain

    ( ) =

    Y=1

    !X=1

    Q=1; 6= ( )

  • 48 Basic distributions

    If all rates are equals = , the case reduces to ( ) =

    +

    with

    [ ] = and with probability density

    ( ) =2

    Z +( + )

    Again, the contour can be closed over the negative half plane and the -thorder poles are deduced from Cauchys relation for the -th derivative of acomplex function

    1

    !

    ( )= 0

    =1

    2

    Z( 0)

    ( )

    ( 0) +1

    as

    ( ) =( 1)!

    1

    1

    =

    =( ) 1

    ( 1)!(3.24)

    For integer , this density corresponds to the Erlang random variable.When extended to real values of = ,

    ( ; ) =( ) 1

    ( )(3.25)

    it is called the Gamma probability density function, with corresponding pgf

    ( ; ) =

    +

    =

    1 +

    (3.26)

    and distribution

    ( ; ) =( )

    Z0

    1 (3.27)

    This integral, the incomplete Gamma-function, can only be expressed inclosed analytic form if is an integer. Hence, for the -Erlang randomvariable , the distribution follows after repeated partial integration as

    ( ; ) =( 1)!

    Z0

    1 = 11X

    =0

    ( )

    !(3.28)

    We observe that Pr[ ] =P 1

    =0( )

    ! , which equals Pr[ 1]where is a Poisson random variable with mean = . Further, Pr[] = Pr[ ], where [ ] = , or the distribution of the sum of

    i.i.d. exponential random variables each with rate follows by scalingfrom the distribution of the sum of i.i.d. exponential random variables

    each with unit rate (or mean 1). Moreover, (2.65) and (3.26) show that a

  • 3.3 Derived distributions 49

    sum of independent Gamma random variables specified by (but withsame ) is again a Gamma random variable with =

    P=1 .

    At last all centered moments follow from (2.39) by series expansion around= 0 as

    [( ) ] = ( 1)

    1 +

    =0

    = ( 1) !X=0

    ( )

    ( )!

    In particular, since [ ] = = , we find with

    = ( 1) ( + )! ( )

    [( ) ] = ( 1) !X=0

    ( )!

    = ( 1)X=0

    ( 1)

    ( + )

    ( )

    = ( 1) ( ) ( + 1 + )

    where ( ) is the confluent hypergeometric function (Abramowitz andStegun, 1968, Chapter 13). For example, if = 2, the variance equals2 = 2 and further,

    ( )3

    = 23 ,

    ( )4

    = 3 ( +2)4 and

    ( )5= 4 (5 +6)5 .

    3.3.2 The sum of independent uniform random variables

    The sum =P

    =1 of i.i.d. uniform random variables has asdistribution function the -fold convolution of the uniform density function

    ( ) = 10 1 on [0 1] denoted by( )

    ( ). The distribution functionequals

    Pr [ ] =

    [ ]X=0

    ( 1)

    !( )!( ) (3.29)

    Indeed, from (2.66) and (3.13) the Laplace transform of is

    ( ) =

    1

  • 50 Basic distributions

    The inverse Laplace transform determines, for 0,

    ( )( ) = Pr [ ] =

    1

    2

    Z + 1 Using (1 ) =

    P=0

    ( 1) and the integral 12

    R ++1 =

    ! 1Re( ) 0, yields

    ( )( ) =

    X=0

    ( 1)

    ( ) 1

    ( 1)!1( ) 0 (3.30)

    from which (3.29) follows by integration.

    3.3.3 The chi-square distribution

    Suppose that the total error of independent measurements , each per-turbed by Gaussian noise, has to be determined. In order to prevent that er-rors may cancel out, the sum of the squared errors =

    P=0

    2 is preferredrather than

    P=0 | |. For simplicity, we assume that all errors = ,

    where is the exact value of quantity , have zero mean and unit variance.The corresponding distribution of is known as the chi-square distribution.From the 2-distribution, the 2-test in statistics is deduced which deter-mines the goodness of a model of a distribution to a set of measurements.We refer for a discussion of the 2-test to Leon-Garcia (1994, Section 3.8)or Allen (1978, Section 8.4).We first deduce the distribution of the square = 2 of a random variableand note that if and are independent so are the random variables

    ( ) and ( ). The event { } or { 2 } is equivalent to {} and non-existent if 0. With (2.29) and 0,Pr [ ] = Pr [ ] = ( ) ( )

    and, after di erentiation,

    2( ) =( ) + ( )

    2

    If is a Gaussian random variable ( 2), then is, for 0,

    2( ) =exp

    h( + 2)2 2

    i2

    cosh

    2

    In particular, for (0 1) random variables where = 0 and = 1, 2( ) =

  • 3.4 Functions of random variables 51

    2

    2reduces to a Gamma distribution (3.25) with = 12 and =

    12 . Since

    the sum of independent Gamma random variables with ( ) is again aGamma random variable ( ), we arrive at the chi-square 2 probabilitydensity function,

    2( ) =2

    1

    2 22

    2 (3.31)3.4 Functions of random variables

    3.4.1 The maximum and minimum of a set of independentrandom variables

    The minimum of i.i.d. random variables { }1 possesses the distri-bution4

    Pr min1

    = Pr [at least one ] = Pr [not all ]

    or

    Pr min1

    = 1

    Y=1

    Pr[ ] (3.32)

    whereas for the maximum,

    Pr max1

    = Pr [not all ] = 1

    Y=1

    Pr[ ]

    or

    Pr max1

    =

    Y=1