cimentaciones maquinas

Upload: mauricio-de-los-santos

Post on 03-Jun-2018

243 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 CIMENTACIONES MAQUINAS

    1/235

    Handbook

    Statistical foundations of machine learning

    Gianluca Bontempi

    Machine Learning Group

    Computer Science Department

    Universite Libre de Bruxelles, ULBBelgique

  • 8/11/2019 CIMENTACIONES MAQUINAS

    2/235

    2

  • 8/11/2019 CIMENTACIONES MAQUINAS

    3/235

    Contents

    Index 2

    1 Introduction 11.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2 Foundations of probability 112.1 The random model of uncertainty . . . . . . . . . . . . . . . . . . . . 11

    2.1.1 Axiomatic definition of probability . . . . . . . . . . . . . . . 132.1.2 Symmetrical definition of probability . . . . . . . . . . . . . . 132.1.3 Frequentist definition of probability . . . . . . . . . . . . . . 142.1.4 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . 142.1.5 Independence and conditional probability . . . . . . . . . . . 152.1.6 Combined experiments . . . . . . . . . . . . . . . . . . . . . . 162.1.7 The law of total probability and the Bayes theorem . . . . . 182.1.8 Array of joint/marginal probabilities . . . . . . . . . . . . . . 18

    2.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.3 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . 212.3.1 Parametric probability function . . . . . . . . . . . . . . . . . 222.3.2 Expected value, variance and standard deviation of a discrete

    r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3.3 Moments of a discrete r.v. . . . . . . . . . . . . . . . . . . . . 242.3.4 Entropy and relative entropy . . . . . . . . . . . . . . . . . . 24

    2.4 Continuous random variable . . . . . . . . . . . . . . . . . . . . . . . 252.4.1 Mean, variance, moments of a continuous r.v. . . . . . . . . . 26

    2.5 Joint probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5.1 Marginal and conditional probability . . . . . . . . . . . . . . 27

    2.6 Common discrete probability functions . . . . . . . . . . . . . . . . . 282.6.1 The Bernoulli trial . . . . . . . . . . . . . . . . . . . . . . . . 282.6.2 The Binomial probability function . . . . . . . . . . . . . . . 28

    2.6.3 The Geometric probability function . . . . . . . . . . . . . . 292.6.4 The Poisson probability function . . . . . . . . . . . . . . . . 29

    2.7 Common continuous distributions . . . . . . . . . . . . . . . . . . . . 302.7.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . 302.7.2 Exponential distribution . . . . . . . . . . . . . . . . . . . . . 312.7.3 The Gamma distribution . . . . . . . . . . . . . . . . . . . . 312.7.4 Normal distribution: the scalar case . . . . . . . . . . . . . . 312.7.5 The chi-squared distribution . . . . . . . . . . . . . . . . . . 332.7.6 Studentst-distribution . . . . . . . . . . . . . . . . . . . . . 332.7.7 F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.7.8 Bivariate continuous distribution . . . . . . . . . . . . . . . . 352.7.9 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.8 Normal distribution: the multivariate case . . . . . . . . . . . . . . . 37

    3

  • 8/11/2019 CIMENTACIONES MAQUINAS

    4/235

    4 CONTENTS

    2.8.1 Bivariate normal distribution . . . . . . . . . . . . . . . . . . 38

    2.9 Linear combinations of r.v. . . . . . . . . . . . . . . . . . . . . . . . 39

    2.9.1 The sum of i.i.d. random variables . . . . . . . . . . . . . . . 40

    2.10 Transformation of random variables . . . . . . . . . . . . . . . . . . 402.11 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 41

    2.12 The Chebyshevs inequality . . . . . . . . . . . . . . . . . . . . . . . 41

    3 Classical parametric estimation 43

    3.1 Classical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.1.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.2 Empirical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.3 Plug-in principle to define an estimator . . . . . . . . . . . . . . . . 46

    3.3.1 Sample average . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.3.2 Sample variance . . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.4 Sampling distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5 The assessment of an estimator . . . . . . . . . . . . . . . . . . . . . 48

    3.5.1 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . 48

    3.5.2 Bias and variance of . . . . . . . . . . . . . . . . . . . . . . 49

    3.5.3 Bias of the estimator2 . . . . . . . . . . . . . . . . . . . . . 50

    3.5.4 Bias/variance decomposition of MSE . . . . . . . . . . . . . . 51

    3.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    3.5.6 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    3.5.7 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.6 The Hoeffdings inequality . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.7 Sampling distributions for Gaussian r.v.s . . . . . . . . . . . . . . . . 54

    3.8 The principle of maximum likelihood . . . . . . . . . . . . . . . . . . 54

    3.8.1 Maximum likelihood computation . . . . . . . . . . . . . . . 56

    3.8.2 Properties of m.l. estimators . . . . . . . . . . . . . . . . . . 57

    3.8.3 Cramer-Rao lower bound . . . . . . . . . . . . . . . . . . . . 58

    3.9 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    3.9.1 Confidence interval of . . . . . . . . . . . . . . . . . . . . . 59

    3.10 Combination of two estimators . . . . . . . . . . . . . . . . . . . . . 61

    3.10.1 Combination ofm estimators . . . . . . . . . . . . . . . . . . 61

    3.11 Testing hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    3.11.1 Types of hypothesis . . . . . . . . . . . . . . . . . . . . . . . 63

    3.11.2 Types of statistical test . . . . . . . . . . . . . . . . . . . . . 63

    3.11.3 Pure significance test . . . . . . . . . . . . . . . . . . . . . . . 63

    3.11.4 Tests of significance . . . . . . . . . . . . . . . . . . . . . . . 643.11.5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 65

    3.11.6 Receiver Operating Characteristic curve . . . . . . . . . . . . 68

    3.11.7 Choice of test . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    3.11.8 UMP level-test . . . . . . . . . . . . . . . . . . . . . . . . . 69

    3.11.9 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . 71

    3.12 Parametric tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    3.12.1 z-test (single and one-sided) . . . . . . . . . . . . . . . . . . . 72

    3.12.2 t-test: single sample and two-sided . . . . . . . . . . . . . . . 73

    3.12.3 2-test: single sample and two-sided . . . . . . . . . . . . . . 73

    3.12.4 t-test: two samples, two sided . . . . . . . . . . . . . . . . . . 74

    3.12.5 F-test: two samples, two sided . . . . . . . . . . . . . . . . . 74

  • 8/11/2019 CIMENTACIONES MAQUINAS

    5/235

    CONTENTS 5

    4 Nonparametric estimation and testing 754.1 Nonparametric methods . . . . . . . . . . . . . . . . . . . . . . . . . 754.2 Estimation of arbitrary statistics . . . . . . . . . . . . . . . . . . . . 76

    4.3 Jacknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.3.1 Jacknife estimation . . . . . . . . . . . . . . . . . . . . . . . . 774.4 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    4.4.1 Bootstrap sampling . . . . . . . . . . . . . . . . . . . . . . . 794.4.2 Bootstrap estimate of the variance . . . . . . . . . . . . . . . 794.4.3 Bootstrap estimate of bias . . . . . . . . . . . . . . . . . . . . 80

    4.5 Bootstrap confidence interval . . . . . . . . . . . . . . . . . . . . . . 814.5.1 The bootstrap principle . . . . . . . . . . . . . . . . . . . . . 81

    4.6 Randomization tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.6.1 Randomization and bootstrap . . . . . . . . . . . . . . . . . . 83

    4.7 Permutation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.8 Considerations on nonparametric tests . . . . . . . . . . . . . . . . . 84

    5 A statistical framework of supervised learning 875.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.2 Estimating dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 905.3 The problem of classification . . . . . . . . . . . . . . . . . . . . . . 92

    5.3.1 Inverse conditional distribution . . . . . . . . . . . . . . . . . 945.4 The problem of regression estimation . . . . . . . . . . . . . . . . . . 96

    5.4.1 An illustrative example . . . . . . . . . . . . . . . . . . . . . 965.5 Generalization error . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    5.5.1 The decomposition of the generalization error in regression . 1005.5.2 The decomposition of the generalization error in classification 101

    5.6 The supervised learning procedure . . . . . . . . . . . . . . . . . . . 1025.7 Validation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    5.7.1 The resampling methods . . . . . . . . . . . . . . . . . . . . . 1045.8 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    6 The machine learning procedure 1076.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.3 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.4 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.5 The dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096.6 Parametric identification . . . . . . . . . . . . . . . . . . . . . . . . . 110

    6.6.1 Error functions . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.6.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 110

    6.7 Structural identification . . . . . . . . . . . . . . . . . . . . . . . . . 1146.7.1 Model generation . . . . . . . . . . . . . . . . . . . . . . . . . 1156.7.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.7.3 Model selection criteria . . . . . . . . . . . . . . . . . . . . . 120

    6.8 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    7 Linear approaches 1237.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    7.1.1 The univariate linear model . . . . . . . . . . . . . . . . . . . 1237.1.2 Least-squares estimation . . . . . . . . . . . . . . . . . . . . . 1247.1.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . . 1267.1.4 Partitioning the variability . . . . . . . . . . . . . . . . . . . 1267.1.5 Test of hypotheses on the regression model . . . . . . . . . . 126

    7.1.6 Interval of confidence . . . . . . . . . . . . . . . . . . . . . . 127

  • 8/11/2019 CIMENTACIONES MAQUINAS

    6/235

    6 CONTENTS

    7.1.7 Variance of the response . . . . . . . . . . . . . . . . . . . . . 128

    7.1.8 Coefficient of determination . . . . . . . . . . . . . . . . . . . 129

    7.1.9 Multiple linear dependence . . . . . . . . . . . . . . . . . . . 129

    7.1.10 The multiple linear regression model . . . . . . . . . . . . . . 1297.1.11 The least-squares solution . . . . . . . . . . . . . . . . . . . . 130

    7.1.12 Variance of the prediction . . . . . . . . . . . . . . . . . . . . 131

    7.1.13 The HAT matrix . . . . . . . . . . . . . . . . . . . . . . . . . 131

    7.1.14 Generalization error of the linear model . . . . . . . . . . . . 131

    7.1.15 The expected empirical error . . . . . . . . . . . . . . . . . . 131

    7.1.16 The PSE and the FPE . . . . . . . . . . . . . . . . . . . . . 133

    7.2 The PRESS statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

    7.3 The weighted least-squares . . . . . . . . . . . . . . . . . . . . . . . 138

    7.3.1 Recursive least-squares . . . . . . . . . . . . . . . . . . . . . . 139

    7.4 Discriminant functions for classification . . . . . . . . . . . . . . . . 141

    7.4.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

    7.4.2 Support vector machines . . . . . . . . . . . . . . . . . . . . . 147

    8 Nonlinear approaches 153

    8.1 Nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

    8.1.1 Artificial neural networks . . . . . . . . . . . . . . . . . . . . 156

    8.1.2 From global modeling to divide-and-conquer . . . . . . . . . . 162

    8.1.3 Classification and Regression Trees . . . . . . . . . . . . . . . 163

    8.1.4 Basis Function Networks . . . . . . . . . . . . . . . . . . . . . 167

    8.1.5 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . 167

    8.1.6 Local Model Networks . . . . . . . . . . . . . . . . . . . . . . 168

    8.1.7 Neuro-Fuzzy Inference Systems . . . . . . . . . . . . . . . . . 170

    8.1.8 Learning in Basis Function Networks . . . . . . . . . . . . . . 1718.1.9 From modular techniques to local modeling . . . . . . . . . . 174

    8.1.10 Local modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 175

    8.2 Nonlinear classification . . . . . . . . . . . . . . . . . . . . . . . . . . 184

    8.2.1 Naive Bayes classifier . . . . . . . . . . . . . . . . . . . . . . 186

    8.2.2 SVM for nonlinear classification . . . . . . . . . . . . . . . . . 187

    9 Model averaging approaches 189

    9.1 Stacked regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

    9.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

    9.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

    9.3.1 The Ada Boost algorithm . . . . . . . . . . . . . . . . . . . . 193

    9.3.2 The arcing algorithm . . . . . . . . . . . . . . . . . . . . . . . 1959.3.3 Bagging and boosting . . . . . . . . . . . . . . . . . . . . . . 195

    10 Conclusions 197

    10.1 Causality and dependencies . . . . . . . . . . . . . . . . . . . . . . . 198

    A Unsupervised learning 201

    A.1 Probability density estimation . . . . . . . . . . . . . . . . . . . . . . 201

    A.1.1 Nonparametric density estimation . . . . . . . . . . . . . . . 201

    A.1.2 Semi-parametric density estimation . . . . . . . . . . . . . . . 203

    A.2 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

    A.3 Fuzzy clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

    A.4 Fuzzy c-ellyptotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

  • 8/11/2019 CIMENTACIONES MAQUINAS

    7/235

    CONTENTS 7

    B Some statistical notions 211B.1 Useful relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211B.2 Convergence of random variables . . . . . . . . . . . . . . . . . . . . 211

    B.3 Limits and probability . . . . . . . . . . . . . . . . . . . . . . . . . . 212B.4 Expected value of a quadratic form . . . . . . . . . . . . . . . . . . . 212B.5 The matrix inversion formula . . . . . . . . . . . . . . . . . . . . . . 213B.6 Proof of Eq. (5.20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213B.7 Biasedness of the quadratic empirical risk . . . . . . . . . . . . . . . 213

    C Kernel functions 215

    D Datasets 217D.1 USPS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217D.2 Golub dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

    E R scripts 219

  • 8/11/2019 CIMENTACIONES MAQUINAS

    8/235

    8 CONTENTS

  • 8/11/2019 CIMENTACIONES MAQUINAS

    9/235

    Chapter 1

    Introduction

    In recent years, a growing number of organizations have been allocating vast amount

    of resources to construct and maintain databases and data warehouses. In scientificendeavours, data refers to carefully collected observations about some phenomenonunder study. In business, data capture information about economic trends, criticalmarkets, competitors and customers. In manufacturing, data record machineryperformances and production rates in different conditions. There are essentiallytwo reasons why people gather increasing volumes of data: first, they think somevaluable assets are implicitly coded within them, and computer technology enableseffective data storage at reduced costs.

    The idea of extracting useful knowledge from volumes of data is common to manydisciplines, from statistics to physics, from econometrics to system identificationand adaptive control. The procedure for finding useful patterns in data is knownby different names in different communities, viz., knowledge extraction, pattern

    analysis, data processing. More recently, the set of computational techniques andtools to support the modelling of large amount of data is being grouped under themore general label ofmachine learning [47].

    The need for programs that can learn was stressed by Alan Turing who ar-gued that it may be too ambitious to write from scratch programs for tasks thateven human must learn to perform. This handbook aims to present the statisticalfoundations of machine learning intended as the discipline which deals with the au-tomatic design of models from data. In particular, we focus on supervised learningproblems (Figure 1.1), where the goal is to model the relation between a set ofinputvariables, and one or more output variables, which are considered to be dependenton the inputs in some manner.

    Since the handbook deals with artificial learning methods, we do not take intoconsideration any argument of biological or cognitive plausibility of the learningmethods we present. Learning is postulated here as a problem of statistical estima-tion of the dependencies between variables on the basis of empirical data.

    The relevance of statistical analysis arises as soon as there is a need to extractuseful information from data records obtained by repeatedly measuring an observedphenomenon. Suppose we are interested in learning about the relationship betweentwo variables x (e.g. the height of a child) and y (e.g. the weight of a child)which are quantitative observations of some phenomenon of interest (e.g. obesityduring childhood). Sometimes, thea prioriknowledge that describes the relationbetween x and y is available. In other cases, no satisfactory theory exists and allthat we can use are repeated measurements ofx and y. In this book our focus is thesecond situation where we assume that only a set of observed data is available. Thereasons for addressing this problem are essentially two. First, the more complex

    is the input/output relation, the less effective will be the contribution of a human

    1

  • 8/11/2019 CIMENTACIONES MAQUINAS

    10/235

    2 CHAPTER 1. INTRODUCTION

    OUTPUT

    ERROR

    PREDICTION

    MODEL

    INPUT

    PREDICTION

    TARGET

    DATA

    Figure 1.1: The supervised learning setting. Machine learning aims to infer fromobserved data the best model of the stochastic input/output dependency.

    expert in extracting a model of the relation. Second, data driven modelling maybe a valuable support for the designer also in modelling tasks where he can takeadvantage of existing knowledge.

    Modelling from data

    Modelling from data is often viewed as an art, mixing an experts insight with theinformation contained in the observations. A typical modelling process cannot be

    considered as a sequential process but is better represented as a loop with manyfeedback paths and interactions with the model designer. Various steps are repeatedseveral times aiming to reach, through continuous refinements, a good descriptionof the phenomenon underlying the data.

    The process of modelling consists of a preliminaryphase which brings the datafrom their original form to a structured configuration and a learning phase whichaims to select the model, orhypothesis, that best approximates the data (Fig. 1.2).

    The preliminary phase can be decomposed in the following steps:

    Problem formulation. Here the model designer chooses a particular applicationdomain, a phenomenon to be studied, and hypothesizes the existence of a(stochastic) relation (or dependency) between the measurable variables.

    Experimental design. This step aims to return a dataset which, ideally, shouldbe made of samples that are well-representative of the phenomenon in orderto maximize the performance of the modelling process [35].

    Pre-processing. In this step, raw data are cleaned to make learning easier. Pre-processing includes a large set of actions on the observed data, such as noisefiltering, outlier removal, missing data treatment [76], feature selection, andso on.

    Once the preliminary phase has returned the dataset in a structured input/outputform (e.g. a two-column table), called training set, the learning phase begins. Agraphical representation of a training set for a simple learning problem with one

    input variablex and one output variable y is given in Figure 1.3. This manuscript

  • 8/11/2019 CIMENTACIONES MAQUINAS

    11/235

    3

    IDENTIFICATION

    PHASE

    PREPROCESSING

    LEARNING

    PHASE

    PROBLEM

    FORMULATION

    EXPERIMENTAL

    DESIGN

    PARAMETRIC

    IDENTIFICATION VALIDATION

    TRAINING SET

    MODEL

    SELECTION

    STRUCTURAL

    PRELIMINARY

    Figure 1.2: The modelling process and its decomposition in preliminary phase and

    learning phase.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    12/235

    4 CHAPTER 1. INTRODUCTION

    0011 0 00 01 11 1

    0

    0

    1

    1

    0

    0

    1

    1

    0

    0

    1

    1

    0

    0

    1

    1

    0 0

    0 0

    1 1

    1 1

    01

    0 0

    0 0

    1 1

    1 1

    01

    0

    0

    1

    1

    0011 0

    0

    1

    1

    0 01 1

    0 0

    0 0

    1 1

    1 1

    01 01

    01

    0 01 1 0011 0 00 01 11 1

    0

    0

    1

    1

    0

    0

    1

    1

    0

    0

    1

    1

    0

    0

    1

    1

    0 0

    0 0

    1 1

    1 1

    01

    0

    0

    1

    1

    0011 0

    0

    1

    1

    0 01 1

    0 0

    0 0

    1 1

    1 1

    01 01

    01

    0 01 1 0011 0 00 01 11 1

    0

    0

    1

    1

    0

    0

    1

    1

    0

    0

    1

    1

    0

    0

    1

    1

    0 0

    0 0

    1 1

    1 1

    0

    0

    1

    1

    0 0

    0 0

    1 1

    1 1

    01

    0

    0

    1

    1

    0011 0

    0

    1

    1

    0 01 1

    0 0

    0 0

    1 1

    1 1

    0 00 01 11 1

    01

    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0000000

    0

    0

    0

    0

    0

    0

    0

    0

    0

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1111111

    1

    1

    1

    1

    1

    1

    1

    1

    1

    x

    01

    y

    01

    01

    0 01 1

    Figure 1.3: A training set for a simple learning problem with one input variable xand one output variable y . The dots represent the observed samples.

    01

    0 0

    0 0

    1 1

    1 1

    01

    0 0

    0 0

    1 1

    1 1

    0 01 1

    0 01 1

    0 0

    0 0

    1 1

    1 1

    01

    01

    0 0

    0 0

    1 1

    1 1

    0

    0

    1

    1

    0

    0

    1

    1

    0

    0

    1

    1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    x

    y

    0

    0

    1

    1

    0 0

    0 0

    1 1

    1 1

    01

    Figure 1.4: A second realization of the training set for the same phenomenon ob-served in Figure 1.3. The dots represent the observed samples.

    will focus exclusively on this second phase assuming that the preliminary steps havealready been performed by the model designer.

    Suppose that, on the basis of the collected data, we wish to learn the unknowndependency existing between the x variable and the y variable. In practical terms,the knowledge of this dependency could shed light on the observed phenomenonand allow us to predict the value of the output y for a given input (e.g. what isthe expected weight of child which is 120cm tall?). What is difficult and tricky inthis task is the finiteness and the random nature of data. For instance a second setof observations of the same pair of variables could produce a dataset (Figure 1.4)

    which is not identical to the one in Figure 1.3 though both originate from the samemeasurable phenomenon. This simple fact suggest that a simple interpolation ofthe observed data would not produce an accurate model of the data.

    The goal of machine learning is to formalize and optimize the procedure whichbrings from data to model and consequently from data to predictions. A learningprocedure can be concisely defined as a search, in a space of possible model config-urations, of the model which best represents the phenomenon underlying the data.As a consequence, a learning procedure requires both a search space, where possiblesolutions may be found, and an assessment criterionwhich measures the quality ofthe solutions in order to select the best one.

    The search space is defined by the designer using a set of nested classes withincreasing complexity. For our introductory purposes, it is sufficient to consider

    here a classas a set of input/output models (e.g. the set of polynomial models)

  • 8/11/2019 CIMENTACIONES MAQUINAS

    13/235

    5

    2

    3

    1

    0 0

    0 0

    1 1

    1 1

    01

    01

    0

    0

    1

    1

    01

    0 0

    0 0

    1 1

    1 1

    0 01 1

    01

    0

    0

    1

    1 01 01

    0

    0

    1

    1

    010 00 01 11 1 0 00 01 11 1 01

    01

    01

    0

    0

    1

    1

    0 0

    0 0

    1 1

    1 1

    0

    0

    1

    1

    01

    0 0

    0 0

    1 1

    1 1

    0 01 1

    01

    0

    0

    1

    1 01 01

    0

    0

    1

    1

    010 00 01 11 1 0 00 01 11 1 01

    01

    01

    0

    0

    1

    1

    0 0

    0 0

    1 1

    1 1

    01

    01

    0

    0

    1

    1

    01

    0 0

    0 0

    1 1

    1 1

    0 01 1

    01

    0

    0

    1

    1 0011

    0 01 1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    000000

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    111111

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    x

    y

    0 01 1

    01

    0

    0

    1

    1

    010 00 01 11 1 0 00 01 11 1 01

    01

    01

    0

    0

    1

    1

    Figure 1.5: Training set and three parametric models which belongs to the class offirst order polynomials.

    01

    01

    0

    0

    1

    1

    0 0

    0 0

    1 1

    1 1

    01

    01

    0

    0

    1

    1

    01

    0 0

    0 0

    1 1

    1 1

    0 01 1

    01

    0

    0

    1

    1

    01

    01

    0

    0

    1

    1

    01

    0 0

    0 0

    1 1

    1 1

    0 0

    0 0

    1 1

    1 1

    01

    01

    01

    0

    0

    1

    1

    0 0

    0 0

    1 1

    1 1

    0

    0

    1

    1

    01

    0 0

    0 0

    1 1

    1 1

    0 01 1

    01

    0

    0

    1

    1

    01

    01

    0

    0

    1

    1

    01

    0 0

    0 0

    1 1

    1 1

    0 0

    0 0

    1 1

    1 1

    01

    01

    01

    0

    0

    1

    1

    0 0

    0 0

    1 1

    1 1

    0 01 1

    01

    0

    0

    1

    1

    01

    0 0

    0 0

    1 1

    1 1

    0 01 1

    01

    0

    0

    1

    1

    0

    0

    1

    1

    0 01 1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    000000

    0

    0

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    111111

    1

    1

    x

    01

    y

    01

    0

    0

    1

    1

    01

    0 0

    0 0

    1 1

    1 1

    0 0

    0 0

    1 1

    1 1

    01

    Figure 1.6: Training set and three parametric models which belongs to the class ofsecond order polynomials.

    with the samemodel structure(e.g. second order degree) and the complexityof theclass as a measure of the set of input/output mappings which can approximated bythe models belonging to the class.

    Figure 1.5 shows the training set of Figure 1.3 together with three parametricmodels which belongs to the class of first-order polynomials. Figure 1.6 showsthe same training set with three parametric models which belongs to the class ofsecond-order polynomials.

    The reader could visually decide whether the class of second order models is moreadequate or not than the first-order class to model the dataset. At the same timeshe could guess which among the three plotted models is the one which producesthe best fitting.

    In real high-dimensional settings, however, a visual assessment of the quality ofa model is not sufficient. Data-driven quantitative criteria are therefore required.We will assume that the goal of learning is to attain a goodstatistical generalization.This means that the selected model is expected to return an accurate prediction ofthe dependent (output) variable when values of the independent (input) variables,which are not part of the training set, are presented.

    Once the classes of models and the assessment criteria are fixed, the goal ofa learning algorithm is to search i) for the best class of models and ii) for thebest parametric model within such a class. Any supervised learning algorithm isthen made of two nested loops denoted as the structural identification loop and theparametric identification loop.

    Structural identification is the outer loop which seeks the model structure which

    is expected to have the best performance. It is composed of a validation phase,

  • 8/11/2019 CIMENTACIONES MAQUINAS

    14/235

    6 CHAPTER 1. INTRODUCTION

    which assesses each model structure on the basis of the chosen assessment criterion,and a selection phase which returns the best model structure on the basis of thevalidation output. Parametric identification is the inner loop which returns the

    best model for a fixed model structure. We will show that the two procedures areintertwined since the structural identification requires the outcome of the parametricstep in order to assess the goodness of a class.

    Statistical machine learning

    On the basis of the previous section we could argue that learning is nothing morethan a standard problem of optimization. Unfortunately, reality is far more com-plex. In fact, because of the finite amount of data and their random nature, thereexists a strong correlation between parametric and structural identification steps,which makes non-trivial the problem of assessing and, finally, choosing the predic-tion model. In fact, the random nature of the data demands a definition of the

    problem in stochastic terms and the adoption of statistical procedures to chooseand assess the quality of a prediction model. In this context a challenging issue ishow to determine the class of models more appropriate to our problem. Since theresults of a learning procedure is found to be sensitive to the class of models chosento fit the data, statisticians and machine learning researchers have proposed overthe years a number of machine learning algorithms. Well-known examples are linearmodels, neural networks, local modelling techniques, support vector machines andregression trees. The aim of such learning algorithms, many of which are presentedin this book, is to combine high generalization with an effective learning procedure.

    However, the ambition of this handbook is to present machine learning as a sci-entific domain which goes beyond the mere collection of computational procedures.Since machine learning is deeply rooted in conventional statistics, any introduc-

    tion to this topic must include some introductory chapters to the foundations ofprobability, statistics and estimation theory. At the same time we intend to showthat machine learning widens the scope of conventional statistics by focusing on anumber of topics often overlooked by statistical literature, like nonlinearity, largedimensionality, adaptivity, optimization and analysis of massive datasets.

    This manuscript aims to find a good balance between theory and practice bysituating most of the theoretical notions in a real context with the help of practicalexamples and real datasets. All the examples will be implemented in the statis-tical programming language R [98]. For an introduction to R we refer the readerto [34, 113]. This practical connotation is particularly important since machinelearning techniques are nowadays more and more embedded in plenty of techno-logical domains, like bioinformatics, robotics, intelligent control, speech and imagerecognition, multimedia, web and data mining, computational finance, business in-telligence.

    Outline

    The outline of the book is as follows. Chapter 2 summarize thes relevant backgroundmaterial in probability. Chapter 3 introduces the parametric approach to parametricestimation and hypothesis testing.

    Chapter 4 presents some nonparametric alternatives to the parametric tech-niques discussed in Chapter 3.

    Chapter 5 introduces supervised learning as the statistical problem of assessingand selecting a hypothesis function on the basis of input/output observations.

    Chapter 6 reviews the steps which lead from raw observations to a final model.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    15/235

    1.1. NOTATIONS 7

    This is a methodological chapter which introduces some algorithmic proceduresunderlying most of machine learning techniques.

    Chapter 7 presents conventional linear approaches to regression and classifica-tion.

    Chapter 8 introduces some machine learning techniques which deal with nonlin-ear regression and classification tasks.

    Chapter 9 presents the model averaging approach, a recent and powerful way forobtaining improved generalization accuracy by combining several learning machines.

    Although the book focuses on supervised learning, some related notions of un-supervised learning and density estimation are given in Appendix A.

    1.1 Notations

    Throughout this manuscript, boldface denotes random variables and normal font isused for istances (realizations) of random variables. Strictly speaking, one shouldalways distinguish in notation between a random variable and its realization. How-ever, we will adopt this extra notational burden only when the meaning is not clear

    from the context.

    As far as variables are concerned, lowercase letters denote scalars or vectors ofobservables, greek letters denote parameter vectors and uppercase denotes matrices.Uppercase in italics denotes generic sets while uppercase in greek letters denotessets of parameters.

    Generic notation

    Parameter vector. Random parameter vector.M Matrix.[N n] Dimensionality of a matrix with N rows and n columns.MT Transpose of the matrix M.diag[m1, . . . , mN] Diagonal matrix with diagonal [m1, . . . , mN]M Random matrix.

    Estimate of .

    Estimator of .

    Index in an iterative algorithm.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    16/235

    8 CHAPTER 1. INTRODUCTION

    Probability Theory notation

    Set of possible outcomes.

    Outcome (or elementary event).{E} Set of possible events.E Event.Prob {E} Probability of the eventE.(, {E}, Prob {}) Probabilistic model of an experiment.Z Domain of the random variablez.P(z) Probability distribution of a discrete random variablez. AlsoPz(z).F(z) = Prob {z z} Distribution function of a continuous random variablez. Also Fz(z).p(z) Probability density of a continuous r.v.. Alsopz(z).E[z] Expected value of the random variablez.Ex[z] =

    Xz (x, y)p(x)dx Expected value of the random variablez averaged over x.

    Var [z] Variance of the random variablez.L() Likelihood of a parameter .l() Log-Likelihood of a parameter .lemp() Empirical Log-likelihood of a parameter .

    Learning Theory notation

    x Multidimensional input variable.X n Input space.y Multidimensional output variable.Y Output space.xi ith realization of the random vectorx.

    f(x) Target regression function.w Random noise variable.zi= xi, yi Input-output sample: ith case in training set.N Number of available samples.DN = {z1, z2, . . . , zN} Training set. Class of hypothesis. Hypothesis parameter vector.h(x, ) Hypothesis function.s Hypothesis class of complexitys.L

    y, f(x, )

    Loss function.R() Functional risk.0 arg min R().Remp() Empirical functional risk.

    N arg min Remp().GN Mean integrated squared error (MISE).l Number of folds in cross-validation.

    Gcv Cross-validation estimate ofGN.

    Gloo Leave-one-out estimate ofGN.Ntr Number of samples used for training in cross-validation.Nts Number of samples used for test in cross-validation.iNtr i= 1, . . . , l Parameter which minimizes the empirical risk ofDNtr .D(i) Training set with thei

    th sample set aside.N(i) Parameter which minimizes the empirical risk ofD(i).

    Gboot Bootstrap estimate ofGN.DN Bootstrap training set of sizeNgenerated by DNwith replacement.

    Nb Number of bootstrap samples.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    17/235

    1.1. NOTATIONS 9

    Data analysis notation

    xij jth element of vector xi.

    Xij ij

    th

    element of matrix X.q Query point (point in the input space where a prediction is required).yq Prediction in the query point.

    yji Leave-one-out prediction inxi with thejth sample set aside.

    elooj =yjyjj Leave-one-out error with thejth sample set aside.K() Kernel function.B Bandwidth. Linear coefficients vector.

    Least-squares parameters vector.

    j Least-squares parameters vector with the j th sample set aside.hj(x, ) jt

    h, j = 1, . . . , m, local model in a modular architecture.j Activation or basis function.

    j Set of parameters of the activation function.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    18/235

    10 CHAPTER 1. INTRODUCTION

  • 8/11/2019 CIMENTACIONES MAQUINAS

    19/235

    Chapter 2

    Foundations of probability

    Probability theory is the discipline concerned with the study of uncertain (or ran-

    dom) phenomena and probability is the mathematical language adopted for quan-tifying uncertainty. Such phenomena, although not predictable in a deterministicfashion, may present some regularities and consequently be described mathemati-cally by idealized probabilistic models. These models consist of a list of all possi-ble outcomes together with the respective probabilities. The theory of probabilitymakes possible to infer from these models the patterns of future behaviour.

    This chapter presents the basic notions of probability which serves as a necessarybackground to understand the statistical aspects of machine learning. We ask thereader to become acquainted with two aspects: the notion of random variable asa compact representation of uncertain knowledge and the use of probability as aneffective formal tool to manipulate and process uncertain information. In particular,we suggest the reader give special attention to the notions of conditional and joint

    probability. As we will see in the following, these two related notions are extensivelyused by statistical modelling and machine learning to define the dependence andthe relationships between random variables.

    2.1 The random model of uncertainty

    We define a random experimentas any action or process which generates results orobservations which cannot be predicted with certainty. Uncertainty stems from theexistence of alternatives. In other words, each uncertain phenomenon is character-ized by a multiplicity of possible configurations or outcomes. Weather is uncertainsince it can take multiple forms (e.g. sunny, rainy, cloudy,...). Other examples ofrandom experiments are tossing a coin, rolling dice, or measuring the time to reach

    home.A random experiment is then characterized by asample space that is a (finite

    or infinite) set of all the possible outcomes (or configurations) of the experiment.The elements of the set are called experimental outcomes or realizations. Forexample, in the die experiment, = {1, 2, . . . , 6} andi stands for the outcomecorresponding to getting the face with the number i. If is the outcome of ameasurement of some physical quantity, e.g. pressure, then we could have = R+.

    Representing uncertainty is a modelling activity and as such it is not unique. Inother terms the sample space of a random experiment is not necessarily unique. Inthe die experiment, we could define an alternative sample space made of two soleoutcomes: numbers equal to and different from 1. Also we could be interested inrepresenting the uncertainty of two consecutive tosses. In that case the outcome

    would be the pair (i, j).

    11

  • 8/11/2019 CIMENTACIONES MAQUINAS

    20/235

    12 CHAPTER 2. FOUNDATIONS OF PROBABILITY

    Uncertainty stems from variability. Each time we observe a random phenomenon,we may observe different configurations. In the probability jargon, observing a ran-dom phenomenon is interpreted as the realization of a random experiment. A single

    performance of a random experiment is called a trial. This means that after eachtrial we observe one outcome i .A subset of experimental outcomes is called an event. Consider a trial which

    generated the outcome i: we say that an eventEoccurred during the trial if thesetEcontains the element i. For example, in the die experiment, an event is theset of even valuesE= {2, 4, 6}. This means that when we observe the outcome4 the eventeven number takes place.

    An event composed of a single outcome, e.g. E ={1} is called an elementaryevent.

    Note that since eventsEare subsets, we can apply to them the terminology ofthe set theory:

    refers to the certain event i.e. the event that occurs in every trial.

    the notationEc = {: / E}

    denotes the complement ofE.

    the notationE1 E2= { : E1 OR E2}

    refers to the event that occurs whenE1 orE2 or both occur.

    the notationE1 E2= { : E1 AND E2}

    refers to the event that occurs when bothE1 andE2 occur.

    two eventsE1 andE2 are mutually exclusive ordisjoints if

    E1 E2=

    that is each time thatE1 occurs,E2 does not occur.

    a partition of is a set of disjoint setsEj , j = 1, . . . , nsuch that

    nj=1

    Ej =

    given an eventEwe define the indicator function ofE by

    IE() =

    1 if E0 if / E (2.1)

    Let us consider now the notion of class of events. An arbitrary collection ofsubsets of is not a class of events. We require that ifE1 andE2 are events, thesame holds also for the intersection E1E2and the union E1E2. A set of events thatsatisfies these conditions is called, in mathematical terms, a Borel field [89]. Wewill consider only Borel fields since we want to deal not only with the probabilities

    of single events, but also with the probabilities of their unions and intersections.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    21/235

    2.1. THE RANDOM MODEL OF UNCERTAINTY 13

    2.1.1 Axiomatic definition of probability

    Probability is a measure of uncertainty. Once a random experiment is defined,we call probability of the event

    E the real number Prob

    {E} [0, 1] assigned to

    each eventE. The function Prob{} : [0, 1] is called probability measure orprobability distribution and must satisfy the following three axioms:

    1. Prob{E} 0 for anyE.2. Prob{} = 13. Prob{E1+ E2} = Prob {E1} + Prob {E2} ifE1 andE2 are mutually exclusive.These conditions are known as the axioms of the theory of probability [74].

    The first axiom states that all the probabilities are nonnegative real numbers. Thesecond axiom attributes a probability of unity to the universal event , thus pro-viding a normalization of the probability measure. The third axiom states that theprobability function must be additive, consistently with the intuitive idea of howprobabilities behave.

    All probabilistic results are based directly or indirectly on the axioms and onlythe axioms, for instance

    E1 E2 Prob {E1} Prob {E2} (2.2)Prob {Ec} = 1 Prob {E} (2.3)

    There are many interpretations and justifications of these axioms and we dis-cuss briefly the frequentist and the Bayesian interpretation in Section 2.1.3. Whatis relevant here is that the probability function is a formalization of uncertaintyand that most of its properties and results appear to be coherent with the humanperception of uncertainty [67].

    However, a major question remains still open: how to compute the probabilityvalue Prob{E} for a generic eventE? The assignment of probabilities is perhapsthe most difficult aspect of constructing probabilistic models. Although the theoryof probability is neutral, that is it can make inferences regardless of the probabil-ity assignments, its results will be strongly affected by the choice of a particularassignment. This means that, if the assignments are inaccurate, the predictions ofthe model will be misleading and will not reflect the real behaviour of the modelledphenomenon. In the following sections we are going to present some procedureswhich are typically adopted in practice.

    2.1.2 Symmetrical definition of probability

    Consider a random experiment where the sample space is made ofM symmetricoutcomes (i.e., they are equally likely to occur). Let the number of outcomes whichare favourable to the eventE (i.e. if they occur then the eventE takes place) beME.

    An intuitive definition of probability (also known as the classical definition) ofthe eventEthat adheres to the axioms is

    Prob {E} = MEM

    (2.4)

    In other words,the probability of an event equals the ratio of its favourable outcomesto the total number of outcomes provided that all outcomes are equally likely [89].

    This is typically the approach adopted when the sample space is made of thesix faces of a fair die. Also in most cases the symmetric hypothesis is accepted

    as self evident: if a ball is selected at random from a bowl containing W white

  • 8/11/2019 CIMENTACIONES MAQUINAS

    22/235

    14 CHAPTER 2. FOUNDATIONS OF PROBABILITY

    balls andB black balls, the probability that we select a white one is W/(W+ B).Note that this number is determined without any experimentation and is basedon symmetrical assumptions. But how to be sure that the symmetrical hypothesis

    holds? Think for instance to the probability that a newborn be a boy. Is this asymmetrical case? Moreover, how would one define the probability of an event ifthe symmetrical hypothesis does not hold?

    2.1.3 Frequentist definition of probability

    Let us consider a random experiment and an eventE. Suppose we repeat theexperimentNtimes and that we record the number of times NE that the eventEoccurs. The quantity

    NEN

    (2.5)

    comprised between 0 and 1 is known as the relative frequency ofE. It can beobserved that if the experiment is carried out a large number of times under exactly

    the same conditions, the frequency converges to a fixed value for increasing N. Thisobservation led von Mises to use the notion of frequency as a foundation for thenotion of probability.

    Definition 1 (von Mises). The probability Prob {E} of an eventE is the limit

    Prob {E} = limN

    NEN

    where N is the number of observations andNE is the number of times thatE oc-curred.

    This definition appears reasonable and it is compatible with the axioms in Sec-tion 2.1.1. However, in practice, in any physical experience the number N is finite

    and the limit has to be accepted as a hypothesis, not as a number that can bedetermined experimentally [89]. Notwithstanding, the frequentist interpretation isvery important to show the links between theory and application. At the same timeit appears inadequate to represent probability when it is used to model a degree-of-belief. Think for instance to the probability that your professor wins a Nobel Prize:how to define a number N of repetitions?

    An important alternative interpretation of the probability measure comes thenfrom the Bayesian approach. This approach proposes a degree-of-belief interpre-tation of probability according to which Prob {E} measures an observers strengthof belief thatE is or will be true. This manuscript will not cover the Bayesianapproach to statistics and data analysis. Interested readers are referred to [51].

    2.1.4 The Law of Large NumbersA well-known justification of the frequentist approach is provided by the Weak Lawof Large Numbers, proposed by Bernoulli.

    Theorem 1. Let Prob {E} =p and suppose that the eventE occursNEtimes inNtrials. Then, NEN converges to p in probability, that is, for any >0,

    Prob

    NENp 1 as N

    According to this theorem, the ratio NE/N is close to p in the sense that, forany > 0, the probability that|NE/N p| tends to 1 as N . Thisresults justifies the widespread use of Monte Carlo simulation to solve numerically

    probability problems.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    23/235

    2.1. THE RANDOM MODEL OF UNCERTAINTY 15

    0 e+00 2 e+05 4 e+05 6 e+05

    0

    .2

    0.3

    0.4

    0.5

    0.6

    Number of trials

    Relativefrequency

    0 e+00 2 e+05 4 e+05 6 e+05

    0

    200

    400

    600

    Number of trials

    Absolutedifference(no.headsandtails)

    Figure 2.1: Fair coin tossing random experiment: evolution of the relative frequency(left) and of the absolute difference between the number of heads and tails (R scriptfreq.R).

    Note that this does NOT mean that NEwill be close to Np. In fact,

    Prob {NE=N p} 12N p(1 p) 0, as N (2.6)

    For instance, in a fair coin-tossing game this law does not imply that the absolutedifference between the number of heads and tails should oscillate close to zero [109](Figure 2.1). On the contrary it could happen that the absolute difference keepsgrowing (though slower than the number of tosses).

    2.1.5 Independence and conditional probability

    Let us introduce the definition of independent events.

    Definition 2 (Independent events). . Two eventsE1 andE2 are independent ifand only if

    Prob {E1 E2} =Prob {E1} Prob {E2} (2.7)and we writeE1 E2.

    Note that the quantity Prob {E1 E2} is often referred to as joint probabilityand denoted by Prob {E1, E2}.

    As an example of two independent events, think of two outcomes of a roulettewheel or of two coins tossed simultaneously.

    Exercise

    Suppose that a fair die is rolled and that the number x appears. Let E1be the eventthat the number x is even,E2 be the event that the number x is greater than orequal to 3,

    E3 be the event that the number x is a 4,5 or 6.

    Are the eventsE1 andE2 independent? Are the eventsE1 andE3 independent?

  • 8/11/2019 CIMENTACIONES MAQUINAS

    24/235

    16 CHAPTER 2. FOUNDATIONS OF PROBABILITY

    Example

    LetE1 andE2 two disjoint events with positive probability. Can they be indepen-dent? The answer is no since

    Prob {E1 E2} = 0 = Prob {E1} Prob {E2} >0

    LetE1 be an event such that Prob {E1} > 0 andE2 another event. We definethe conditional probability ofE2 given thatE1 has occurred as follows:Definition 3(Conditional probability). If Prob {E1} >0 then the conditional prob-ability ofE2 givenE1 is

    Prob {E2|E1} = Prob {E1 E2}Prob

    {E1

    } (2.8)

    The following properties hold:

    IfE1 E2 Prob {E2|E1} = 1 (2.9)IfE2 E1 Prob {E2|E1} Prob {E2} (2.10)

    Note that for any fixedE1, the quantity Prob {|E1} satisfies the axioms of proba-bility. For instance ifE2,E3 andE4 are disjoint events we have that

    Prob {E2 E3 E4|E1} = Prob {E2|E1} + Prob {E3|E1} + Prob{E4|E1}However this does NOT generally hold for Prob {E1|}, that is when we fix the termE1 on the left of the conditional bar. For two disjoint eventsE2 andE3, in general

    Prob {E1|E2 E3} = Prob{E1|E2} + Prob {E1|E3}Also it is generally NOT the case that Prob {E2|E1} = Prob {E1|E2}.

    The following result derives from the definition of conditional probability.

    Lemma 1. IfE1 andE2 are independent events thenProb {E1|E2} =Prob {E1} (2.11)

    2.1.6 Combined experiments

    Note that so far we assumed that all the events belong to the same sample space.The most interesting use of probability concerns however combinedrandom exper-iments whose sample space

    = 1

    2

    . . . n

    is the Cartesian product of several spaces i, i = 1, . . . , n.For instance if we want to study the probabilistic dependence between the height

    and the weight of a child we have to define a joint sample space

    = {(w, h) : w w, h h}made of all pairs (w, h) where w is the sample space of the random experiment de-scribing the weight and h is the sample space of the random experiment describingthe height.

    Note that all the properties studied so far holds also for events which do notbelong to the same sample space. For instance, given a combined experiment =12two events E1 1and E 2are independent if Prob {E1|E2} = Prob {E1}.

    Some examples of real problems modelled by random combined experiments are

    presented in the following.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    25/235

    2.1. THE RANDOM MODEL OF UNCERTAINTY 17

    Gamblers fallacy

    Consider a fair coin-tossing game. The outcome of two consecutive tosses can beconsidered as independent. Now, suppose that we assisted to a sequence of 10

    consecutive tails. We could be tempted to think that the chances that the next tosswill be head are now very large. This is known as thegamblers fallacy[109]. In fact,the fact that we assisted to a very rare event does not imply that the probability ofthe next event will change or rather that it will be dependent on the past.

    Exemple [115]

    Let us consider a medical study about the relationship between the outcome of amedical test and the presence of a disease. We model this study as the combinationof two random experiments:

    1. the random experiment which models the state of the patient. Its samplespace is s ={H, S} where H and S stand for healthy and sick patient,respectively.

    2. the random experiment which models the outcome of the medical test. Itssample space is o ={+, }where + and stand for positive and negativeoutcome of the test, respectively.

    The dependency between the state of the patient and the outcome of the testcan be studied in terms of conditional probability.

    Suppose that out of 1000 patients, 108 respond positively to the test and thatamong them 9 result to be affected by the disease. Also, among the 892 patientswho responded negatively to the test, only 1 is sick. According to the frequentist in-terpretation the probabilities of the joint events Prob

    {Es,

    Eo

    }can be approximated

    according to (2.5) by

    Es =S Es =HEo = + 91000 =.009 10891000 =.099Eo = 11000 =.001 89211000 =.891

    Doctors are interested in answering the following questions. What is the proba-bility of having a positive (negative) test outcome when the patient is sick (healthy)?What is the probability of being in front of a sick (healthy) patient when a positive(negative) outcome is obtained? From the definition of conditional probability wederive

    Prob {Eo = +|Es =S} = Prob {Eo = +, Es =S}

    Prob {Es =S} = .009

    .009 + .001=.9

    Prob {Eo = |Es =H} =Prob {Eo = , Es =H}Prob {Es =H} =

    .891.891 + .099

    =.9

    According to these figures, the test appears to be accurate. Do we have to expect ahigh probability of being sick when the test is positive? The answer is NO as shownby

    Prob {Es =S|Eo = +} =Prob {Eo = +, Es =S}

    Prob {Eo = +} = .009

    .009 + .099 .08

    This example shows that sometimes humans tend to confound Prob {Es|Eo} withProb {Eo|Es}and that the most intuitive response is not always the right one.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    26/235

    18 CHAPTER 2. FOUNDATIONS OF PROBABILITY

    2.1.7 The law of total probability and the Bayes theorem

    Let us consider an indeterminate practical situation where a set of eventsE1,E2,...,

    Ek may occur. Suppose that no two such events may occur simultaneously, but at

    least one of them must occur. This means thatE1,E2,...,Ek are mutually exclusiveand exhaustive or, in other terms, that they form a partition of . The followingtwo theorems can be shown

    Theorem 2(Law of total probability). Let Prob {Ei},i = 1, . . . , k denote the prob-abilities of theith eventEi and Prob {E|Ei}, i= 1, . . . , kthe conditional probabilityof a generic eventE given thatEi has occurred. It can be shown that

    Prob {E} =k

    i=1

    Prob {E|Ei} Prob {Ei} =k

    i=1

    Prob {E Ei} (2.12)

    In this case the quantity Prob {E} is referred to as marginal probability.Theorem 3 (Bayes theorem). The conditional (inverse) probability of anyEi,i= 1, . . . , k given thatEhas occurred is given by

    Prob {Ei|E} = Prob {E|Ei} Prob {Ei}kj=1Prob {E|Ej} Prob {Ej}

    =Prob {E, Ei}

    Prob {E} i= 1, . . . , k (2.13)

    Example

    Suppose that k = 2 and

    E1 is the event: Tomorrow is going to rain.

    E2 is the event: Tomorrow is not going to rain.

    Eis the event: Tonight is chilly and windy.The knowledge of Prob {E1}, Prob {E2} and Prob {E|Ek}, k = 1, 2 makes possiblethe computation of Prob {Ek|E}.

    2.1.8 Array of joint/marginal probabilities

    Let us consider the combination of two random experiments whose sample spacesare A = {A1, , An} and B = {B1, , Bm} respectively. Assume that for eachpairs of events (Ai, Bj), i = 1, . . . , n, j = 1, . . . , m we know the joint probability

    value Prob{Ai, Bj}. The joint probability array contains all the necessary infor-mation for computingall marginal and conditional probabilities by means of (2.12)and (2.8).

    B1 B2 Bm MarginalA1 Prob {A1, B1} Prob {A1, B2} Prob {A1, Bm} Prob {A1}A2 Prob {A2, B1} Prob {A2, B2} Prob {A1, Bm} Prob {A2}

    ......

    ......

    ......

    An Prob {An, B1} Prob {An, B2} Prob {An, Bm} Prob {An}

    Marginal Prob {B1} Prob {B2} Prob {Bm} Sum=1

    where Prob{Ai} =

    j=1,...,mProb {Ai, Bj} and Prob {Bj} =

    i=1,...,n Prob {Ai, Bj}.Using an entry of the joint probability matrix and the sum of the correspondingrow/column, we may use (2.8) to compute the conditional probability as shown in

    the following example.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    27/235

    2.1. THE RANDOM MODEL OF UNCERTAINTY 19

    Example: dependent/independent variables

    Let us model the commute time to go back home for an ULB student living inSt. Gilles as a random experiment. Suppose that its sample space is =

    {LOW,

    MEDIUM, HIGH}. Consider also an (extremely:-) random experiment representingthe weather in Brussels, whose sample space is ={G=GOOD, B=BAD}. Supposethat the array of joint probabilities is

    G (in Bxl) B (in Bxl) MarginalLOW 0.15 0.05 Prob {LOW}= 0.2

    MEDIUM 0.1 0.4 Prob {MEDIUM}= 0.5HIGH 0.05 0.25 Prob {HIGH}= 0.3

    Prob {G}= 0.3 Prob {B}= 0.7 Sum=1

    According to the above probability function, is the commute time dependent onthe weather in Bxl? Note that if weather is good

    LOW MEDIUM HIGH

    Prob {|G} 0.15/0.3=0.5 0.1/0.3=0.33 0.05/0.3=0.16Else if weather is bad

    LOW MEDIUM HIGHProb {|B} 0.05/0.7=0.07 0.4/0.7=0.57 0.25/0.7=0.35

    Since Prob {|G} = Prob {|B}, i.e. the probability of having a certain commute timechanges according to the value of the weather, the relation (2.11) is not satisfied.

    Consider now the dependency between an event representing the commute timeand an event describing the weather in Rome.

    G (in Rome) B (in Rome) Marginal

    LOW 0.18 0.02 Prob {LOW}= 0.2MEDIUM 0.45 0.05 Prob {MEDIUM}= 0.5

    HIGH 0.27 0.03 Prob {HIGH}= 0.3Prob {G}= 0.9 Prob {B}= 0.1 Sum=1

    Our question now is: is the commute time dependent on the weather in Rome?If the weather in Rome is good we obtain

    LOW MEDIUM HIGHProb {|G} 0.18/0.9=0.2 0.45/0.9=0.5 0.27/0.9=0.3

    while if the weather in Rome is bad

    LOW MEDIUM HIGHProb{|B} 0.02/0.1=0.2 0.05/0.1=0.5 0.03/0.1=0.3

    Note that the probability of a commute time event does NOT change according to

    the value of the weather in Rome, e.g. Prob {LOW|B} = Prob{LOW}.Try to answer now the following question: if you would like to predict the

    commute time in Brussels, which event would return more information on it: theweather in Rome or in Brussels?

    Example: Marginal/conditional probability function

    Consider a probabilistic model of the days weather based on the combination ofthe following random descriptors where

    1. the first represents the sky condition and its sample space is =

    {CLEAR,

    CLOUDY}.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    28/235

    20 CHAPTER 2. FOUNDATIONS OF PROBABILITY

    2. the second represents the barometer trend and its sample space is ={RISING,FALLING}.

    3. the third represents the humidity in the afternoon and its sample space is ={DRY,WET}.

    Let the joint probability values be given by the table

    E1 E2 E3 P(E1,E2,E3)CLEAR RISING DRY 0.4CLEAR RISING WET 0.07CLEAR FALLING DRY 0.08CLEAR FALLING WET 0.10

    CLOUDY RISING DRY 0.09CLOUDY RISING WET 0.11CLOUDY FALLING DRY 0.03CLOUDY FALLING WET 0.12

    From the joint values we can calculate the probabilities P(CLEAR,RISING) =0.47 and P(CLOUDY) = 0.35 and the conditional probability value

    P(DRY|CLEAR,RISING) ==

    P(DRY,CLEAR,RISING)

    P(CLEAR,RISING) =

    0.40

    0.47 0.85

    2.2 Random variables

    Machine learning and statistics is concerned with data. What is then the linkbetween the notion of random experiments and data? The answer is provided bythe concept of random variable.

    Consider a random experiment and the associated triple (, {E}, Prob {}). Sup-pose that we have a mapping rule Z Rsuch that we can associate with eachexperimental outcome a real value z() in the domainZ. We say that z is thevalue taken by the random variablez when the outcome of the random experimentis . Henceforth, in order to clarify the distinction between a random variable andits value, we will use the boldface notation for denoting a random variable (as inz) and the normal face notation for the eventually observed value (as in z= 11).

    Since there is a probability associated to each eventEand we have a mappingfrom events to real values, a probability distribution can be associated to z.

    Definition 4 (Random variable). Given a random experiment (, {E}, Prob {}),a random variable z is the result of a mapping Zthat assigns a numberz toevery outcome . This mapping must satisfy the following two conditions:

    the set{z z} is an event for everyz. the probabilities

    Prob {z= } = 0 Prob {z= } = 0

    Given a random variable z Z and a subset I Z we define the inversemapping

    z1(I) = { |z() I} (2.14)

  • 8/11/2019 CIMENTACIONES MAQUINAS

    29/235

    2.3. DISCRETE RANDOM VARIABLES 21

    wherez1(I) {E} is an event. On the basis of the above relation we can associatea probability measure to z according to

    Prob {z

    I} = Probz1(I) = Prob { |z() I} (2.15)Prob {z= z} = Probz1(x) = Prob { |z() = z} (2.16)

    In other words, a random variable is a numerical quantity, linked to some ex-periment involving some degree of randomness, which takes its value from some setZof possible real values. For example the experiment might be the rolling of twosix-sided dice and the r.v. z might be the sum (or the maximum) of the two num-bers showing in the dice. In this case, the set of possible values isZ={2, . . . , 12}(orZ= {1, . . . , 6} ).

    Example

    Suppose that we want to decide when to leave ULB to go home for watching our

    football team playing the Champions League final match. In order to take a deci-sion, a quantity of interest is the (random) commute timezfor getting from ULB tohome. Our personal experience is that this time is a positive number which is notconstant: for examplez1 = 10 minutes, z2 = 23 minutes, z3 = 17 minutes, wherezi is the the time taken on theith day of the week. The variability of this quantityis related to a complex random process with a large sample space (dependingfor example on the weather condition, the weekday, the sport events in town, andso on). The probabilistic approach proposes to use a random variable to representthis uncertainty and to assume each measure zi to be the realization of a randomoutcomei and the result of a mapping zi =z (i). The use of a random variablez to represent our commute time becomes then, a compact (and approximate)way of modelling the disparate set of causes underlying the uncertainty of this phe-

    nomenon. For example, this representation will allow us to compute when to leaveULB if we want the probability of missing the beginning of the match to be lessthan 5 percent.

    2.3 Discrete random variables

    The probability (mass) functionof a discrete r.v. z is the combination of

    1. the discrete setZof values that the r.v. can take (also called range)2. the set of probabilities associated to each value of

    Z.

    This means that we can attach to the random variable some specific mathemat-ical function Pz(z) that gives for each z Z the probability that z assumes thevaluez

    Pz(z) = Prob {z= z} (2.17)This function is calledprobability function or probability mass function.

    As depicted in the following example, the probability function can be tabulatedfor a few sample values of z. If we toss a coin twice, and the random variable zis the number of heads that eventually turn up, the probability function can betabulated as follows

    Values of the random variable z 0 1 2

    Associated probabilities 0.25 0.50 0.25

  • 8/11/2019 CIMENTACIONES MAQUINAS

    30/235

    22 CHAPTER 2. FOUNDATIONS OF PROBABILITY

    2.3.1 Parametric probability function

    Sometimes the probability function is not precisely known but can be expressed asa function ofz and a quantity . An example is the discrete r.v. z that takes its

    value fromZ= {1, 2, 3}and whose probability function is

    Pz(z) = 2z

    2 + 4 + 6

    where is some fixed nonzero real number.Whatever the value of,Pz(z)> 0 forz = 1, 2, 3 andPz(1)+ Pz(2)+ Pz(3) = 1.

    Therefore z is a well-defined random variable, even if the value of is unknown.We call a parameter, that is some constant, usually unknown, involved in theanalytical expression of a probability function. We will see in the following that theparametric form is a convenient way to formalize a family of probabilistic modelsand that the problem of estimation can be seen as a parameter identification task.

    2.3.2 Expected value, variance and standard deviation of adiscrete r.v.

    Though the probability function Pz provides a complete description of the uncer-tainty ofz, it is often not practical to use since it demands the definition of a setof values equal to the size ofZ. Therefore, it is often more convenient to dealwith some compact representation ofPz obtained by computing a functional (i.e. afunction of a function) ofPz.

    The most common one-number summary of the distributionPz is the expectedvalue which is a measure of central tendency.

    Definition 5 (Expected value). The expected value of a discrete random variable

    z is defined to beE[z] = =

    zZ

    zPz(z) (2.18)

    assuming that the sum is well-defined.

    Note that the expected value is not necessarily a value that belongs to thedomainZof the random variable. It is important also to remark that while theterm meanis used as a synonym ofexpected value, this is not the case for the termaverage. We will add more on this difference in Section 3.3.2.

    The concept of expected value was first introduced in the 17th century by C.Huygens in order to study the games of chance.

    Example [109]

    Let us consider a European roulette player who places a 1 $ bet on a single numberwhere the roulette uses the numbers 0, 1, . . . , 36 and the number 0 is considered aswinning for the house. The gain of the player is a random variablez whose samplespace isZ ={1, 35}. In other words only two outcomes are possible: either theplayer winsz1= 1$ (i.e. loses 1$) with probability p1= 36/37 or he wins z2= 35$with probabilityp2= 1/37. The expected gain is then

    E[z] =p1z1+ p2z2= p1 (1) +p2 35 = 36/37 + 35/37 = 1/37 = 0.027

    In other words while casinos gain on average 2 .7 cents for every staked dollar,players on average are giving away 2.7 cents (whatever sophisticated their betting

    strategy is).

  • 8/11/2019 CIMENTACIONES MAQUINAS

    31/235

    2.3. DISCRETE RANDOM VARIABLES 23

    2 4 6 8 10

    0.0

    0.2

    0.4

    0.6

    0.8

    1:10

    P

    2 4 6 8 10

    0.0

    0.2

    0.4

    0.6

    0.8

    1:10

    PP

    Figure 2.2: Two discrete probability functions with the same mean and differentvariance

    A common way to summarize the spread of a distribution is provided by the

    variance.

    Definition 6 (Variance). The variance of a discrete random variable z is definedas

    Var[z] =2 =E[(z E[z])2] =zZ

    (z E[z])2Pz(z) (2.19)

    The variance is a measure of the dispersion of the probability function of the

    random variable around its mean. Note that the following relation holds

    2 =E[(z E[z])2] =E[z2 2zE[z] + (E[z])2] ==E[z2] (E[z])2 =E[z2] 2 (2.20)

    whatever is the probability function ofz. Figure 2.2 illustrate two example discreter.v. probability functions which have the same mean but different variance.

    Note that an alternative measure of spread could be represented by E[|z |].However this quantity is much more difficult to be analytically manipulated thanthe variance.

    The variance Var [z] does not have the same dimension as the values ofz. Forinstance ifz is measured in m, Var [z] is expressed in m2. A measure for the spreadthat has the same dimension as the r.v. z is the standard deviation.

    Definition 7 (Standard deviation). The standard deviation of a discrete randomvariablez is defined as the positive square root of the variance.

    Std[z] =

    Var[z] =

    Example

    Let us consider a random variable z Z= {0, 1} wherePz(1) =p andPz(0) = 1p.In this case

    E[z] =p 1 + 0 (1 p) = p (2.21)E[z2] =p 1 + 0 (1 p) = p (2.22)

    Var[z] = E[z2

    ] (E[z])2

    =p p2

    =p(1 p) (2.23)

  • 8/11/2019 CIMENTACIONES MAQUINAS

    32/235

    24 CHAPTER 2. FOUNDATIONS OF PROBABILITY

    0 20 40 60 80 100 120

    0.1996

    0.1998

    0.2000

    0.2002

    0.2004

    z

    PROB

    1200 1000 800 600 400 200 0

    0.0907

    0.0908

    0.0909

    0.0

    910

    0.0911

    z

    PROB

    Figure 2.3: A discrete probability function with positive skewness (left) and one

    with a negative skewness (right).

    2.3.3 Moments of a discrete r.v.

    Definition 8 (Moment). For any positive integerr, therth moment of the proba-bility function is

    r =E[zr] =

    zZ

    zrPz(z) (2.24)

    Note that the first moment coincides with the mean , while the second momentis related to the variance according to Equation (2.20). Higher-order momentsprovide additional information, other than the mean and the spread, about theshape of the probability function.

    Definition 9 (Skewness). The skewness of a discrete random variablez is definedas

    =E[(z )3]

    3 (2.25)

    Skewness is a parameter that describes asymmetry in a random variables prob-ability function. Probability functions with positive skewness have long tails to theright, and functions with negative skewness have long tails to the left (Figure 2.3).

    2.3.4 Entropy and relative entropyDefinition 10. Entropy Given a discrete r.v. z, the entropy of the probabilityfunctionPz(z) is defined by

    H(z) = zZ

    Pz(z)log Pz(z)

    H(z) is a measure of the unpredictability of the r.v. z. Suppose that there areMpossible values for the r.v. z. The entropy is maximized (and takes the valuelog M) ifPz(z) = 1/M for all z . It is minimized iff P(z) = 1 for one value ofzwhile all the others are zero.

    Although entropy measures as well as variance the uncertainty of a r.v., it differs

    from the variance since it depends only on the probabilities of the different values

  • 8/11/2019 CIMENTACIONES MAQUINAS

    33/235

    2.4. CONTINUOUS RANDOM VARIABLE 25

    and not on the values themselves. In other terms, Hcan be thought as a functionof the probability function Pz rather than ofz.

    Let us now consider two different discrete probability functions on the same set

    of values P0=Pz0 (z), P1= Pz1(z)

    whereP0(z)> 0 if and only ifP1(z)> 0.Therelative entropies(or theKullback-Leibler divergences) associated with these

    two function are

    H(P0||P1) =z

    P0(z)logP0(z)

    P1(z) H(P1||P0) =

    z

    P1(z)logP1(z)

    P0(z)

    These asymmetric quantities measure the dissimilarity between the two functions.A symmetric formulation of the dissimilarity is provided by the divergence quantity

    J(P0, P1) = H(P0

    ||P1) + H(P1

    ||P0).

    2.4 Continuous random variable

    An r.v. z is said to be a continuous random variable if it can assume any of theinfinite values within a range of real numbers. The following quantities can bedefined:

    Definition 11 (Cumulative distribution function). The(cumulative) distributionfunction ofz is the function

    Fz(z) =Prob {z z} (2.26)This function satisfies the following two conditions:

    1. it is non-decreasing: z1< z2 impliesFz(z1)< Fz(z2).

    2. is is normalized, i.e

    limz

    Fz= 0, limz

    Fz= 1

    Definition 12(Density function). Thedensity function of a real random variablez is the derivative of the distribution function

    pz(z) =dFz(z)

    dz (2.27)

    at all points whereFz() is differentiable.Probabilities of continuous r.v. are not allocated to specific values but rather to

    interval of values. Specifically

    Prob {a < z < b} = ba

    pz(z)dz,

    Z

    pz(z)dz= 1

    Some considerations about continuous r.v. are worthy to be mentioned:

    the quantity Prob {z= z} = 0 for all z . the quantity pz(z) can be bigger than one and also unbounded.

    two r.v. z and x are equal in distribution ifFz(z) = Fx(z) for all z .

  • 8/11/2019 CIMENTACIONES MAQUINAS

    34/235

    26 CHAPTER 2. FOUNDATIONS OF PROBABILITY

    1

    0

    1

    z

    Figure 2.4: Cumulative distribution function and upper critical point.

    2.4.1 Mean, variance, moments of a continuous r.v.

    Consider a continuous scalar r.v. whose range Z= (l, h) and density functionp(z).We have the following definitions

    Definition 13 (Expectation or mean). The mean of a continuous scalar r.v. z isthe scalar value

    = E[z] =

    hl

    zp(z)dz (2.28)

    Definition 14 (Variance). The variance of a continuous scalar r.v. z is the scalarvalue

    2 =E[(z )2] = hl

    (z )2p(z)dz (2.29)

    Definition 15 (Moments). The r-th moment of a continuous scalar r.v. z is thescalar value

    r =E[z

    r

    ] = h

    l z

    r

    p(z)dz (2.30)

    Note that the moment of order r = 1 coincides with the mean ofz.

    Definition 16 (Upper critical point). For a given 0 1 the upper criticalpoint of a continuous r.v. z is the numberz such that

    1 = Prob {z z} =F(z) z = F1(1 )

    Figure 2.4 shows an example of cumulative distribution together with the uppercritical point.

    2.5 Joint probability

    So far, we considered scalar random variables only. However, the most interestingprobabilistic applications are multivariate, i.e. they concern a number of variablesbigger than one.

    Let us consider a probabilistic model described by n discrete random variables.A fully-specified probabilistic model gives thejoint probabilityfor every combinationof the values of the n r.v..

    In the discrete case the model is specified by the values of the probabilities

    Prob{z1= z1, z2 = z2, . . . , zn = zn} =P(z1, z2, . . . , zn) (2.31)

    for every possible assignment of values z1, . . . , zn to the variables.

  • 8/11/2019 CIMENTACIONES MAQUINAS

    35/235

    2.5. JOINT PROBABILITY 27

    Spam mail example

    Let us consider a bivariate probabilistic model describing the relation between thevalidity of a received email and the presence of the word Viagra in the text. Letz1 be the random variable describing the validity of the email (z1 = 0 for spamand z1 = 1 for not-spam) and z2 the r.v. describing the presence (z2 = 1) or theabsence (z2= 0) of the wordViagra. The stochastic relationship between these twovariables can be defined by the joint probability distribution

    z1= 0 z1= 1 Pz2z2= 0 0.1 0.3 0.4z2= 1 0.5 0.1 0.6

    Pz1 0.6 0.4 1

    In the case ofn continuous random variables, the model is specified by the jointdistribution function

    Prob {z1 z1, z2 z2, . . . , zn zn} =F(z1, z2, . . . , zn)

    which returns a value for every possible assignment of values z1, . . . , zn to the vari-ables.

    Having defined the joint probability, we can now go on to define the case whentwo random variables are independent.

    Definition 17(Independent variables). Letx andy be two discrete random vari-ables. Two variablesx andy are defined to bestatistically independent (written as

    x y) if the joint probabilityProb {x= x, y= y} =Prob {x= x} Prob {y= y} (2.32)

    The definition can be easily extended to the continuous case. In qualitativeterms, this means that we do not expect that the observed outcome of one variablewill affect the other. Note that ifx and y are independent and y and z are inde-pendent, then x and z need not be independent. In other words, the relation ofindependence is not a transitive relation.

    Exercise

    Check whether the variable z1 and z2 of the spam mail example are independent.

    2.5.1 Marginal and conditional probability

    Let{z1, . . . , zm} be a subset of size m of the n discrete r.v. for which a jointprobability function (2.31) is defined. The marginal probabilitiesfor the subset canbe derived from (2.31) by summing over all possible combinations of values for theremaining variables.

    P(z1, . . . , zm) = zm+1 zn

    P(z1, . . . , zm, zm+1, . . . , zn)

  • 8/11/2019 CIMENTACIONES MAQUINAS

    36/235

    28 CHAPTER 2. FOUNDATIONS OF PROBABILITY

    Exercise

    Compute the marginal probabilities P(z1 = 0) and P(z1 = 1) from the joint prob-ability of the spam mail example.

    For continuous random variables the marginal density is

    p(z1, . . . , zm) =

    p(z1, . . . , zm, zm+1, . . . , zn)dzm+1 . . . d zn (2.33)

    The following definition for r.v. derives directly from (2.8).

    Definition 18 (Conditional probability function). The conditional probabilityfunction for one subset of discrete variables{zi :iS1} given values for anotherdisjoint subset{zj :j S2} whereS1 S2= , is defined as the ratio

    P({zi : i S1}|{zj :j S2}) = P({zi : i S1}, {zj :j S2})P(

    {zj :j

    S2

    })

    Note that ifx and y are independent then

    Px(x|y= y) =Px(x) (2.34)Also, since independence is symmetric, this is equivalent to say that Py(y|x= x) =Py(y).

    Definition 19 (Conditional density function). The conditional density functionfor one subset of continuous variables{zi : i S1} given values

    for another disjoint subset{zj : j S2} where S1 S2 =, is defined as theratio

    p({zi : i S1}|{zj :j S2}) = p({zi : i S1}, {zj :j S2})p({zj :j S2})

    wherep({zj :j S2}) is the marginal density of the setS2 of variables.

    2.6 Common discrete probability functions

    2.6.1 The Bernoulli trial

    ABernoulli trialis a random experiment with two possible outcomes, often calledsuccess and failure. The probability of success is denoted by pand the probabil-ity of failure by (1 p). ABernoulli random variable zis a discrete r.v. associatedwith the Bernoulli trial. It takesz = 0 with probability (1 p) and z = 1 withprobabilityp.

    The probability function ofz can be written in the form

    Prob {z= z} =Pz(z) = pz(1 p)1z, z = 0, 1Note thatE[z] =p and Var[z] =p(1 p).

    2.6.2 The Binomial probability function

    A binomial random variablerepresents the number of successes in a fixed numberN ofindependent Bernoullitrials with thesame probabilityof success for each trial.A typical example is the number z of heads in Ntosses of a coin.

    The probability function ofz Bin(N, p) is given by

    Prob{z= z} =Pz(z) =

    N

    z

    pz(1 p)Nz, z= 0, 1, . . . , N (2.35)

    The mean of the probability function is = N p. Note that:

  • 8/11/2019 CIMENTACIONES MAQUINAS

    37/235

    2.6. COMMON DISCRETE PROBABILITY FUNCTIONS 29

    the Bernoulli probability function is a special case (N= 1) of the binomialfunction,

    for small p, the probability of having at least 1 success in Ntrials is propor-tional toN, as long as N p is small,

    ifz1Bin(N1, p) andz2Bin(N1, p) are independent thenz1 + z2Bin(N1 + N2, p)

    2.6.3 The Geometric probability function

    A r.v. zhas ageometricprobability function if it represents the number of successesbefore the first failure in a sequence of independent Bernoulli trials with probabilityof successp. Its probability function is

    Pz(z) = (1 p)pz, z = 0, 1, 2, . . .

    The geometric probability function has an important property, known as thememoryless or Markov property. According to this property, given two integersz1 0, z2 0,

    Pz(z= z1+ z2|z> z1) = Pz(z2)Note that it is the only discrete probability function with this property.

    A r.v. z has a generalized geometric probability function if it represents thenumber of Bernoulli trials preceding but not including the k + 1th failure. Itsfunction is

    Pz(z) =

    z

    kpzk(1 pk+1), z = k, k+ 1, k+ 2, . . .

    where k is supposed to be given. The reader may verify that these functions are,indeed, probability functions by showing that

    zZPz(z) = 1 by using the formula

    for the sum of a geometric series.

    The geometric (and generalized geometric probability function) are encounteredin some problems in queuing theory and risk analysis.

    2.6.4 The Poisson probability function

    A r.v. z has a Poisson probability function with parameter > 0 and is denotedbyz Poisson() if

    Pz(z) = ez

    z! , z = 0, 1, . . . (2.36)

    The Poisson probability function is a limiting form of the binomial function. Ifthe number of trials N is large, the probability of success p of each trial is smalland the productN p= is moderate, then the probability ofz successes accordingto the Binomial function is very close to the probability that a Poisson randomvariable with