newcomb-benford’s law applications to electoral processes, bioinformatics, and the stock index

NEWCOMB-BENFORD’S LAW APPLICATIONS TO

ELECTORAL PROCESSES, BIOINFORMATICS, AND

THE STOCK INDEX

By

David A. Torres Nunez

SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

AT

UNIVERSITY OF PUERTO RICO

RIO PIEDRAS, PUERTO RICO

MAY 2006

c© Copyright by David A. Torres Nunez, 2006


DEPARTMENT OF

MATHEMATICS

The undersigned hereby certify that they have read and

recommend to the Faculty of Graduate Studies for acceptance

a thesis entitled “Newcomb-Benford’s Law Applications to

Electoral Processes, Bioinformatics, and the Stock Index”

by David A. Torres Nunez in partial fulfillment of the requirements

for the degree of Master of Science.

Dated: May 2006

Supervisor:Dr. Luis Raul Pericchi Guerra

Readers:Dr. Marıa Eglee Perez

Dr. Dieter Reetz.

ii


Date: May 2006

Author: David A. Torres Nunez

Title: Newcomb-Benford’s Law Applications to Electoral

Processes, Bioinformatics, and the Stock Index

Department: Mathematics

Degree: M.Sc. Convocation: May Year: 2006

Permission is herewith granted to University of Puerto Rico to circulate

and to have copied for non-commercial purposes, at its discretion, the above

title upon the request of individuals or institutions.

Signature of Author

THE AUTHOR RESERVES OTHER PUBLICATION RIGHTS, ANDNEITHER THE THESIS NOR EXTENSIVE EXTRACTS FROM IT MAYBE PRINTED OR OTHERWISE REPRODUCED WITHOUT THE AUTHOR’SWRITTEN PERMISSION.

THE AUTHOR ATTESTS THAT PERMISSION HAS BEEN OBTAINEDFOR THE USE OF ANY COPYRIGHTED MATERIAL APPEARING IN THISTHESIS (OTHER THAN BRIEF EXCERPTS REQUIRING ONLY PROPERACKNOWLEDGEMENT IN SCHOLARLY WRITING) AND THAT ALL SUCH USEIS CLEARLY ACKNOWLEDGED.

iii

To my family, and the extending that always keep faith in

me.

iv

Table of Contents

Table of Contents v

List of Tables vii

List of Figures ix

Abstract i

Acknowledgements ii

Introduction 1

1 Basic Notation and Derivations 4

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 A Differential Equation Approach. . . . . . . . . . . . . . . . 5

1.2.2 The Float Point Notation Scheme. Knuth . . . . . . . . . . . 7

1.2.3 In the Float Point Notation Scheme. Hamming . . . . . . . . 8

1.2.4 The Brownian Model Scheme. Pietronero . . . . . . . . . . . . 10

1.3 A Statistical Derivation of N-B L . . . . . . . . . . . . . . . . . . . . 11

1.3.1 Mantissa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.2 A Natural Probability Space . . . . . . . . . . . . . . . . . . . 15

1.3.3 Mantissa σ-algebra Properties . . . . . . . . . . . . . . . . . . 15

1.3.4 Scale and Base Invariance . . . . . . . . . . . . . . . . . . . . 17

1.4 Mean and Variance of the Dkb . . . . . . . . . . . . . . . . . . . . . . 23

1.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.5.1 Generating r Significant Digit’s Distribution Base b. . . . . . 24

1.5.2 Effects of Bounds in the Newcomb-Benford Generated Values. 25

v

2 Empirical Analysis 30

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2 Changing P-Values in Null Hypothesis Probabilities H0 . . . . . . . . 31

2.2.1 Posterior Probabilities with Uniform Priors . . . . . . . . . . . 33

2.3 Multinomial Model Proposal . . . . . . . . . . . . . . . . . . . . . . . 36

2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5 Conclusions of the examples . . . . . . . . . . . . . . . . . . . . . . . 41

3 Stock Indexes’ Digits 44

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 On Image Analysis in the Microarray Intensity Spot 49

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 Microarray measurements and image processing . . . . . . . . 52

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Electoral Process on a Newcomb Benford Law Context. 57

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 General Democratic Election Model . . . . . . . . . . . . . . . . . . . 58

5.3 Empirical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6 Appendix: MATLAB PROGRAMS. 75

6.1 Matlab Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

vi

List of Tables

1 Newcomb Benford Law for the First Significant Digit . . . . . . . . . 2

1.1 Mean, Variance, Standard Deviation and Variation Coefficient for the

First and Second Significant Digit Distributions. . . . . . . . . . . . . 24

2.1 p- values in terms of Hypotheses probabilities. . . . . . . . . . . . . . 32

2.2 Summary of the results of the above examples. . . . . . . . . . . . . . 41

3.1 N-Benford’s for 1st and 2nd digit: p- values, Probability Null Bound

and Approximate probability for of the different increment . . . . . . 47

3.2 N-Benford’s for 1st and 2nd digit: The probability of the null hypothesis

given the data and the length of the data. . . . . . . . . . . . . . . . 47

4.1 N-Benford’s for 1st and 2nd digit: P (H0|data), P (Approx), P (Frac)

and Pr(BIC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 N-Benford’s for 1st and 2nd digit; the number of observations, p-values. 55

5.1 The second digit proportions analysis of the winner for the set of his-

torical elections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 The second digit proportions analysis of the loser for the set of histor-

ical elections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3 The first digit proportions of the distance between the winner and the

loser for the set of historical elections. . . . . . . . . . . . . . . . . . . 60

vii

5.4 The second digit proportions of the distance between the winner and

the loser for the set of historical elections. . . . . . . . . . . . . . . . 61

5.5 The second digit proportions of the sum between the winner and the

loser for the set of historical elections. . . . . . . . . . . . . . . . . . . 61

5.6 The Newcomb Benford’s for 1st and 2nd digit: for the United States

of North America Presidential Elections 2004. Note the close are the

values of the posterior probability given the data to 1.0. . . . . . . . . 62

5.7 The second digit proportions analysis of the winner for the set of his-

torical elections.Number of observed values, p-value and probability

null bound is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.8 The second digit proportions analysis of the loser for the set of histor-

ical elections.Number of observed values, p-value and probability null

bound is shown. Note that p-values should be smaller than 1e

for the

bound to be valid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.9 The first digit proportions of the distance between the winner and the

loser for the set of historical elections. Number of observed values, p-

value and probability null bound is shown. Note that p-values should

be smaller than 1e

for the bound to be valid. . . . . . . . . . . . . . . 63

5.10 The second digit proportions of the distance between the winner and

the loser for the set of historical elections.Number of observed values,

p-value and probability null bound is shown. Note that p-values should

be smaller than 1e


5.11 The second digit proportions of the sum between the winner and the

loser for the set of historical elections. Number of observed values, p-

value and probability null bound is shown. Note that p-values should

be smaller than 1e


viii

List of Figures

1 Newcomb-Benford Law theoretical frequencies for the first and second sig-

nificant digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1 Constrained Newcomb Benford Law compared with a Restricted Bound with

of digits in K ≤ 99 from numbers between 1 to 99. Here there is no restriction. 28


of digits in K ≤ 50 from numbers between 1 to 99. . . . . . . . . . . . . 28


of digits in K ≤ 20 from numbers between 1 to 99. . . . . . . . . . . . . 29

2.1 Presenting the posterior intervals for the first and digit using symmetric

boxplot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.2 Newcomb-Benford Law theoretical frequencies for the first significant digit. 42

2.3 Newcomb-Benford Law theoretical frequencies for the first significant digit.

This represent the example 1 simulation results. . . . . . . . . . . . . . . 42


This represent the example 2 simulation results. . . . . . . . . . . . . . . 43


This represent the multinomial example simulation results. . . . . . . . . 43

4.1 Histograms of the Intensities and the Adjustments. . . . . . . . . . . . . 55

4.2 N-Benford’s Law compared whit Intensity Micro array Spots Without Ad-

justment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

ix

4.3 N-Benford’s Law compared whit Intensity Micro array Spots With Adjust-

ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1 Presidential election analysis using electoral college votes compare with

N-B Law for the 1st digit. . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Presidential election analysis using electoral college votes compare with

N-B Law for the 2nd digit. . . . . . . . . . . . . . . . . . . . . . . . . 66

5.3 Puerto Rico 2096 Elections compare with the Newcomb Benford Law

for the second digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67


for the second digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68


for the first digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.6 Venezuela Revocatory Referendum Manual Votes Proportions com-

pared with the Newcomb Benford Law’s proportions for the Second

Digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.7 Venezuela Revocatory Referendum Manual Votes Proportions com-

pared with the Newcomb Benford Law’s proportions for Second digit. 71

5.8 Venezuela Revocatory Referendum Electronic and Manual Votes Pro-

portions compared with the Newcomb Benford Law’s for the second

digit proportions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.9 Venezuela Revocatory Referendum Manual Distance between the win-

ner and loser Proportions compared with the Newcomb Benford Law’s

for the proportions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

x

Abstract

Since this rather amazing fact was discovered in 1881 by the American astronomer

Newcomb (1881), many scientist have been searching about members of the outlaws

number family. Newcomb noticed that the pages of the logarithm books containing

numbers starting with 1 were much more worn than the other pages. After analyzing

several sets of naturally occurring data Newcomb went on to derive what later became

Benford’s law. As a tribute to the figure of Newcomb we call this phenomenon, the

Newcomb - Benford’s Law.

We start by establishing a connection between the Microarray and Stock Index

data sets. That can be seen as an extension of the work done by Hoyle David C.

(2002) and Ley (1996). Most of the analysis have been made using Classical and

Bayesian statistics. Here is explained differences between the different scopes on the

hypothesis testing between models Berger J.O. and Pericchi L. R. (2001). Finally,

the applications of this concepts to the different types of data including Microarray,

Stock Index and Electoral Process.

There are several results on constrained data, the most relevant is the Constrained

Newcomb Benford Law and most of the Bayesian Analysis covered, applied to this

problem.

i

Acknowledgements

I wish to express my gratitude to everyone who contributed to making this work pos-

sible. I would like to thank God first also Dr. L. R. Pericchi, my supervisor, for his

many suggestions and constant support during this research. I am also thankful to

the whole faculty of Mathematics for their guidance through the early years of chaos

and confusion.

Doctor Pericchi expressed his interest in my work and supplied me with the preprints

of some of his recent joint work with Berger J. O., which gave me a better perspective

on the results. L.R. Pericchi thank for been more than a supervisor, a father and my

friend. I’m indebted to Dr. Marıa Eglee Perez, Prof. Z. Rodriguez, and Humberto

Ortiz Zuazaga, for provided data and insights during the drafting process.

Also I would like to thank my parents for providing me with the opportunity to be

where I am. Without them, none of this would even be possible. You have always

been my biggest fans and I appreciate that. To my father: thanks for the support,

even if you are not here anymore. To my mother: you are my hero, always.

I would also like to thank my special friends because you have been my biggest critics

throughout my entire personal life and professional career. Your encouragement, in-

put and constructive criticism have been priceless. For that thanks to Ricardo Ortiz,

Ariel Cintron, Antonio Gonzales, Tania Yuisa Arroyo, Erika Migliza, Wally Rivera,

Raquel Torres, Dr. Pedro Rodriguez Esquerdo, Dr. Punchin, Lourdes Vazquez (sis-

ter), Chungseng Yang (brother), Lihai Song and all the extended family.

I would like to thank to my soulmate, Ana Tereza Rodriguez, for keeping me grounded

and for providing me with some memorable experiences.

Rio Piedras, Puerto Rico David Torres Nunez

May 15, 2006

ii

Introduction

The first known person that explain the anomalous distribution of the digits was

in The Journal of mathematics and was the astronomer and Mathematician Simon,

Newcomb. The one who stated;

”The law of probability of occurrences of numbers is such that all mantissa

of their logarithm are equally probable.”

Since then many mathematicians have been enthusiastic in findings sets of data suit-

able to this phenomena. There has been a century of theorems, definitions, con-

jectures, discoveries around the first digit phenomenon. The discovery of this fact

goes back to 1881, when the American astronomer Simon Newcomb noticed that the

first pages of logarithm books (used at that time to perform calculations), the ones

containing numbers that started with 1, were much more worn than the other pages.

However, it has been argued that any book that is used from the beginning would

show more wear and tear on the earlier pages. This story might thus be apocryphal,

just like Isaac Newton’s supposed discovery of gravity from observation of a falling

apple. The phenomenon was rediscovered in 1938 by the physicist Benford (1938),

who checked it on a wide variety on data sets and was credited for it. In 1996, Hill

(1996) proved the result about random mixed distributions of random variables but

1

2

Table 1: Newcomb Benford Law for the First Significant Digit

Digit unit 1 2 3 4 5 6 7 8 9Probability .301 .176 .125 .097 .079 .067 .058 .051 .046

generalizes the Law for dimensionless quantities. Some mathematical series and other

data sets that satisfies Newcomb Benford’s Law are:

• Prime-numbers

• Series distributions

• Fibonacci Series

• Factorial numbers

• Sequences of Powers Numbers in Pascal Triangle

• Demographic Statistics

• Other Social Science data Numbers that appear in magazines and Newspaper.

Intuitively, most people assume that in a string of numbers sampled randomly

from some body of data, the first non-zero digit could be any number from 1 through

9. All nine numbers would be regarded as equally probable. As we show in the

figure below the equally probable assumption of the digits are very different from

the Newcomb Benfords Distribution. As an example for the first digit we have the

following discrete probability distribution function.

3

1 2 3 4 5 6 7 8 90

0.1

0.2

Newcomb−Benford’s Law First Digit

Numbers

Pro

babi

lity

1st

Digit Lawy = 1/9

0 1 2 3 4 5 6 7 8 90

0.05

0.1

0.15

0.2Newcomb−Benford’s Law Second Digit

Numbers

Pro

babi

lity

2nd

Digit Lawy = 1/10

Figure 1: Newcomb-Benford Law theoretical frequencies for the first and second significant

digit.

We will present two different situation with different derivations. The first cover

data with units, like dollars or meters. The second involves units free data like counts

of votes. Applications presented here include:

1. Puerto Rico’s Stock Index.

2. Microarrays data in Bioinformatics.

3. Voting counts in Venezuela, United States and Puerto Rico.

The potential uses are detection of fraud, or detection of corruption of data or lack of

proper scaling. We analyze the statistical properties of such a Benford’s distribution

and illustrate Benford’s law with a lot of data sets of both theoretical and real-life

origins. Applications of Benford’s Law such as detecting fraud are summarized and

explored to reveal the power of this mathematical principle.

Most of the work has been type in Latex format Lamport (1986) and Knuth (1984).

Chapter 1

Basic Notation and Derivations

1.1 Introduction

The data sets of the family of outlaw’s numbers came from two different kinds of

classification. The first is the type of numbers that has different units, like money.

The other type of data sets is the numbers that do not have units like votes. This

last type of data sets can be found in Electoral Process and Mathematical series. In

this chapter are introduced some basic concepts and notation that is consistent with

Hill (1996). This formulation helps to understand in a deep way the probabilistic

bases on the Newcomb Benford Law. There are slightly different derivations, most

of them are not as statistically general as the one presented by Hill. Other example

that will be in this discussion is a base invariance, similar to the one presented by

L. Pietronero (2001). The aim of this work is to generalize Newcomb Benford Law

in order to apply them to wider classes of data sets, and to verify its fit to different

seta of data with modern Bayesian Statistical methods.

4

5

1.2 Derivations

We present here some derivations, first of all we will use a heuristic approach based

on invariance. In this first section, Benford’s law applies to data that are not dimen-

sionless, so the numerical values of the data depend on the units.

1.2.1 A Differential Equation Approach.

If there exists a universal probability distribution P(x) over such numbers, then it

must be invariant under a change of scale Hill (1995a), so

P (kx) = f(k)P (x) (1.2.1)

Integrating with respect x we have

∫

P (kx)dx = f(k)

∫

P (x)dx

If∫

P (x)dx = 1 and f(k) 6= 0, then

∫

P (kx)dx =1

k,

and normalization proceeds as

∫

P (kx)dx = 1 taking y = kx then dy = kdx hence,∫

P (y)dy = 1

k

∫

P (y)dx = 1 finally∫

P (y)dx =1

k

6

Taking derivatives with respect to k

∂P (kx)

∂k= xP

′

(k)

= P (x)f(k)

= P (x)−1

k2

setting k = 1 gives;

xP′

(x) = −P (x)

x(1

x)′

= −x1

x2

=−1

x

= −P (x)

Hence

xP′

(x) = −P (x) (1.2.2)

This equation has solution P (x) = 1x. Although this is not a proper probability

distribution (since it diverges), both the laws of physics and human convention impose

cutoffs. If many powers of 10 lie between the cutoffs, then the probability that the

first (decimal) digit is D is given by the logarithmic distribution

7

PD =

∫ D+1

DP (x)dx

∫ 10

1P (x)dx

=

∫ D+1

D1xdx

∫ 10

11xdx

=log10 x|D+1

D

log10 x|101= log10(1 +

1

D)

The last expression is called the Newcomb - Benford’s Law(NBL).

However, Benford’s law applies not only to scale-invariant data, but also to num-

bers chosen from a variety of different sources. Explaining this fact requires a more

rigorous investigation of central limit-like theorems for the mantissas of random vari-

ables under multiplication. As the number of variables increases, the density function

approaches that of a logarithmic distribution. Hill (1996) rigorously demonstrated

that the ”mixture of distribution” given by random samples taken from a variety of

different distributions is, in fact, Newcomb Benford’s law. Here it will be presented

those results that explain the properties of the NBL.

1.2.2 The Float Point Notation Scheme. Knuth

There are conditions for the leading digit Knuth (1981). He noticed that in order to

account the leading digit’s law its important to observe the way the numbers be writ-

ten in floating point notation. As is suggested the leading digit of u is determined by

log(u) mod 1. The operator r mod (1) represent the fractional part of the number

r, and fu is the normalizing fraction part of u. Let u be a non negative number. Note

8

that the leading digit u is less than d if and only if

(log10 u)mod1 < log10 d (1.2.3)

since 10fu = 10(log10 u)mod1. Taken in preference a random number W from a ran-

dom distribution that may occur in Nature, following Knuth, we may expect that

(log10 W )mod1 ∼ Unif(0, 1), at least for a very good approximation. Similarly is

expected that any transformation of U will be distributed in same manner. Therefore

by (1.2.3) the leading digit will be 1, with probability log10(1+ 11) ≈ 30.103%; it will

be 2 with probability log10 3− log10(2) ≈ 17.609% and in general if r is any real value

in [1, 10) we ought to have 10fu ≤ r approximately log10 r of the time. These shows

a vague picture why the leading digits behave the way they do.

1.2.3 In the Float Point Notation Scheme. Hamming

Another approach was suggested by Hamming (1970). Let p(r) be the probability

that 10fU ≤ r, note that r will be in and between 1 and 10 ,(1 ≤ r ≤ 10) and fU is the

normalized fraction part of a random normalized floating point number U . Taking in

account that this distribution in base invariant, suppose that every constant of our

universe are multiplied by a constant factor c; our universe of random floating point

numbers, this will no t affect the p(r). When we multiply, there is a transformation

from (log10 U)mod1 to (log10 U + log10 c)mod1. Let Pr(·) be the usual probability

function. Then by definition,

p(r) = Pr((log10 U) mod 1 ≤ (log10 r) mod 1)

On the assumptions of (1.2.3), follows

9

p(r) = Pr((log10 U − log10 c) mod 1 ≤ log10r)

=

Pr((log10 U mod 1 ≤ log10r − log10 c)

or Pr((log10 U − log10 c) mod 1 ≤ log10r, if c ≤ r;

Pr((log10 U mod 1 ≤ log10r + 1− log10 c)

or Pr((log10 U − log10 c) mod 1 ≤ log10r, if c ≥ r;

=

Pr( rc) + 1− Pr(10

c), if c ≤ r;

Pr(10rc

) + 1− Pr(10c) if c ≥ r;

Until now the values of r are included in the close interval, [1, 10]. To be methodic

it’s important to extend the values of r to values outside the mentioned interval. For

this is defined p(10nr) = p(r) + n for a positive number n. If we replace 10c

by d in

(1.2.4) can be written as:

p(rd) = p(r) + p(d) (1.2.4)

Under invariance of the distribution under a constant multiplication hypothesis, then

(1.2.4) will be true for all r > 0 and d ∈ [1, 10]. Since p(1) = 0 and p(10) = 1 then

1 = p(10)

= p((n√

10)n)

= p((n√

10)) + p((n√

10)n−1)

= p((n√

10)) + p((n√

10)) + p((n√

10)n−2)

...

= np((n√

10))

10

hence a derivation of the above p(( n√

10m)) = mn

for all positive integers m and n. Is

suggested the continuity of p, its required that

p(r) = log10 r. (1.2.5)

Knuth suggested that to be more rigorous this important to assume that there is

some underlying distribution of numbers F (u); then the desire probability will be

p(r) =∑

m

(F (10mr)− F (10m))

obtained as a result of adding over −∞ < m <∞. Then the hypothesis of invariance

and the continuity assumption led to (1.2.5) that’s is the desire distribution.

1.2.4 The Brownian Model Scheme. Pietronero

From another particular position, note that this can be viewed as a model for the

overcoming oscillations of the stock market or any complex model in nature. A

brownian model will be acceptable for this type of ”Nature Processes” L. Pietronero

(2001). The brownian motion can be seen as a natural event that involves a change

in the position or location of something. They propose N(t + 1) = ξN(t) where

ξ is a positive definite stochastic variable (just for simplicity). With a logarithmic

transformation then a Gaussian process can be found,

log(N(t + 1)) = log(ξ) + log(N(t))

If is consider log(ξ), as a stochastic variable then, log(N(t + 1)), is a brownian move-

ment. Then for t → ∞ P (log(N)) ∼ Unif(0, 1). Transforming the problem to the

original form;

11

∫

P (log10 N)d(log10 N) = C

∫

1

NdN

where C is the normalization factor. Is obtained that P (N) ∼ N−1. This suggest

that the distribution of n is the First Digit Law distribution. By other hand equation

(1.2.5) can be result for any base b. His proposal states that for b > 0 then:

Prob(n) =

∫ n+1

n

N−1dN =

∫ n+1

n

d(log10(n)) = log10

n + 1

n=

logbn+1

n

log10 b(1.2.6)

Finally using logarithm properties we get;

Prob(n) = logb(1 +1

n) (1.2.7)

that is a generalization of the Newcomb - Benford’s Law.

1.3 A Statistical Derivation of N-B L

Theodore Hill has given a more general argument to the dimensionless data. He has

explained the Central-limit-like Theorem for Significant Digit by saying:

Remark 1.3.1. ”Roughly speaking, this law says that if probability distributions are

selected at random and random samples are then taken from each of these distri-

butions in any way so that the overall process is scale or (base)neutral, then the

significant - digit frequencies of the combined sample will converge to the logarithmic

distribution” Hill (1996)

In order to understand such explanation, then here we presented a brief intro-

duction to measure theory. A fundamental concept in the development of the theory

behind the family of outlaws numbers is the mantissa. This permits the isolation of

the groups of significant digits.

12

Let D1b , D

2b , D

3b , . . ., denote the significant digit function(base b).

Example 1.3.1. As an example note that D110(25.4) = 2, D2

10(25.4) = 5 and D310(25.4) =

4.

The exact laws were given by Newcomb (1881) in terms of the significant digits

base 10 are:

Prob(D(1)10 = d

(1)10 ) = log10(1 +

1

d(1)10

); d(1)10 = 1, 2, . . . , 9. (1.3.1)

Prob(D(2)10 = d

(2)10 ) =

9∑

k=1

log10(1 +1

10k + d(2)10

); d(2)10 = 0, 1, . . . , 9. (1.3.2)

This equations show a way to write the NBL in terms of the significant digit.

1.3.1 Mantissa

As we had mention, a way to formalize the form to write numbers in terms of the

digits, then here is introduced the mantissa. The aim of define the mantissa was put

the NBL in a proper countable additive probability framework. Basically the NBL is

a statement in terms of the significant digits functions.

Definition 1.3.1. The mantissa (base 10) of a positive real number x is the unique

number r in ( 110

, 1] with x = r ∗ 10n for some n ∈ Z.

To be more familiar with the mantissa definition looks up to the scientific notation.

Definition 1.3.2. A number is in scientific notation if it is in the form:

Mantissa ∗ 10characteristic

13

, where the mantissa (Latin for makeweight) must be any number 1 through 10

(but not including 10), and the characteristic is an integer indicating the number

of places the decimal moved.

A more general definition of mantissa can be presented, a generalization for any

base b > 0, that’s is as follows;

Definition 1.3.3. For each integer b > 0, the (base b) mantissa function, Mb, is

the function Mb : R+ → [1, b) such that Mb(x) = r, where r is the unique number in

[1, b) with x = r ∗ 10n for some n ∈ Z. For E ∈ [1, b), let

〈E〉b = M−1b (E) =

⊎

n∈Z

bnE ⊂ R+

The (base b) mantissa σ- algebra generated by R+.

Example 1.3.2. Using the function Mb defined above, we can verify that 9 have the

same mantissa function image for different bases, 10, 100 and 1000. For this note

that M10(9) = 9, since 9 = r ∗ 10n = 9 ∗ 100, note that n = 0 and r = 9. The same

case for base b = 100, here n = 2 and r = 9 again.

Moreover note that for b = 2 we have M2(9) = 98

= 1.001 (base 2), since x = 9823.,

this is close to the definition since 98∈ [1, 2).

Remark 1.3.2. Note that the mantissa function, Mb, assigns it a unique value hence

its well define.

An observation is that if E = [1, b) then 〈E〉b = M−1b (E) =

⊎

n∈N−1 bnE = R+.

And 〈1〉10 = 10n : n ∈ Z

14

Lemma 1.3.3. For all b ∈N − 1,

(i) 〈E〉b =⊎n−1

k=0〈bkE〉bn

(ii) Λb = 〈Eb〉 : E ∈ B(1, b)

(iii) Λb ⊂ Λbn ⊂ B for all n ∈N

(iv) Λb is closer under scalar multiplication.

Proof. Part (i) of the lemma follows directly from the definition of 〈〉b; (ii) fol-

lows from the facto that if E is a Borel set in (1, b) then Λb will denote the set

of ⊎n−1

k=0〈bkE〉bn E ∈ B(1, b). Taking point (i) and (ii) together with naturalism we

get point (iii). The last point of the lemma follows from point (ii) since Λb is the

σ-algebra generated by D(1)b , D

(2)b , D

(3)b , . . .

For a more general case of the s-digit law we have:

Prob(mantissa ≤ 1

10) = log10(t); t ∈ [1, 10) ⊆ ℵ (1.3.3)

.

Since we can write (1.3.3) using the digits we have:

Definition 1.3.4. General Significant Digit Law For all positive integer k, all

dj ∈ 0, 1, 2, . . .

Prob(D(1)10 = d

(1)10 , D

(2)10 = d

(2)10 , . . . , D

(k)10 = d

(k)10 ) = log10[1 +

1∑k

i=1 d(i)10 ∗ 10k−i

] (1.3.4)

Corollary 1.3.4. The significant digits are dependent.

15

This corollary can be proof giving a counter example, that’s the way that Hill

work it out. Now is important to state a natural probability space in which we can

describe in a proper form every detail in each Newcomb Benford’s Law scheme. At

this point is needed a strong measure theory tools, as the σ-fields generated by the

set of the r significant digits.

1.3.2 A Natural Probability Space

Let the sample space R+ be the set of positive real numbers. And let the sigma alge-

bra of events simply be the σ-field generated byD(1)10 , D

(2)10 , D

(3)10 , . . . or equivalently,

generated by mantissa function: x→ mantissa(x). This σ-algebra denoted by Λ and

will be called the decimal mantissa σ-algebra. This is a subfield σ of the Borel’s sets

and

S ∈ Λ⇔ S =∞⋃

n=−∞

B ∗ 10n (1.3.5)

for some Borel B ⊆ [1, 10), which is the generalization of D1 =⋃∞

n=−∞[1, 2) ∗ 10n

that’s is the set of positive numbers witch first digit is 1.

1.3.3 Mantissa σ-algebra Properties

The mantissa σ-algebra have several properties;

1. Every non empty set in Λ is infinity with accumulation point at 0 and at +∞.

2. Λ is closer under scalar multiplication.

3. Λ is closer under integral roots, but not powers.

16

4. Λ is self - similar in the sense that S ∈ Λ, then 10m ∗ S = S for every integer m.

Here aS and Sa represent respectively as : s ∈ S and sa : s ∈ S. The first

property implies that finite intervals are not include in Λ, are not expressible in term

of the significant digits. Note that significant digits alone can no be distinguished

between the numbers 10 and 100 and thus the countable additive contradiction as-

sociated with the scale invariance disappear. Properties 1, 2 and 4 follow directly by

1.3.5 but the closure under integral roots needs more details. The square root of a

set in Lambda may need two parts and similarly for higher roots.

Example 1.3.5. If

S = D1 = 1 =

∞⋃

n=−∞

[1, 2) ∗ 10n,

then

S12 =

∞⋃

n=−∞

[1,√

(2)) ∗ 10n ∪∞⋃

n=−∞

[√

(10),√

(20)) ∗ 10n ∈ Λ

but

S2 =

∞⋃

n=−∞

[1, 4) ∗ 102∗n 6∈ Λ

Since it has gaps (which are too large) and thus can not be written in terms of the

digits.

Just as property 2 is the key to the hypotheses of the scale invariance, property 4

is for base invariance a well.

17

1.3.4 Scale and Base Invariance

The mantissa σ−algebra Λ represent a proper measurability structure. In order to

be rigorous is time to state a proper definition of a scale invariant measure.

Definition 1.3.5. A probability measure P on (R+, Λ) is scale invariant if P (S) =

P (sS) for s > 0 and all S ∈ Λ.

The NB Law 1.3.3 1.3.4 is characterize by the scale invariance property.

Theorem 1.3.6. Hill (1995a)A probability measure P on (R+, Λ) is scale invariant

if and only if

P (

∞⋃

n=−∞

[1, t) ∗ 10n) = log10t (1.3.6)

for all t ∈ [1, 10).

Definition 1.3.6. A probability measure P on (R+, Λ) is base invariant if P (S) =

P (S 12) for all positive integers n and all S ∈ Λ.

Observe that the set of numbers

St = Dt = t, Dj = 0∀j 6= t ∧ t ∈ [1, 10)

= . . . , 0.0t, 0.t, t, t0, t00, . . .

=⋃∞

n=−∞[1, t) ∗ 10n

has by 1.3.5 no nonempty Λ− measurable subsets. Recall the definition of a Dirac

measure:

Definition 1.3.7. The Dirac measure δt associated to a point St ∈ Λ is defined as

follows: δt(St) = t if t ∈ St and δt(St) = 0 if t 6∈ St

18

Using the above definition and letting PL denote the logarithmic probability dis-

tribution on (R+, Λ) given in 1.3.3, a complete characterization for base-invariant

significant- digit probability measures can now be given.

Theorem 1.3.7. Hill (1995a)A probability measure P on (R+, Λ) is base invariant

if and only if

P = qPL + (1− q)δt

for some q ∈ [0, 1]

Note that P is as a convex combination of the two measures; PL and δt. Using

theorems 1.3.6 and 1.3.7 T. Hill state that scale invariance implies base invariant but

not conversely. This is because δt is base invariant but not scale invariant. The proof

of those theorems are not relevant but important in the sense of resume the statistical

derivation presented by T. Hill. Recall that a (real Borel) random probability measure

(r.p.m.) M is a random vector (on an underlying probability space (Ω; F; P ) taking

values which are Borel probability measures on R, and which is regular in the sense

that for each Borel set B ⊂ R, M (B) is a random variable.

Definition 1.3.8. The expected distribution measure of r.p.m F is the probability

measure EF (on the borel subsets of R) defined by

(EM)(B) = E(M(B))for all Borel B ⊂ R (1.3.7)

(where here and throughout, E() denotes expectation with respect to P on the

underlying probability space).

The next definition plays a central role in this section, and formalizes the concept

of the following natural process which mimics Benford’s data-collection procedure:

19

pick a distribution at random and take a sample of size k from this distribution; then

pick a second distribution at random, and take a sample of size k from this second

distribution, and so forth.

Definition 1.3.9. For a r.p.m M and positive integer k, a sequence of

M − randomk − samples is a sequence of random variables X1, X2 . . . on (Ω; F; P )

so that for some i.i.d. sequence M1, M2, . . . of r.p.m.’s with the same distribution as

M , and for each j = 1, 2, . . . given Mj = P , the random variables X(j−1)k+1 . . . , Xjk

are i.i.d. with d.f. P ; and X(j−1)k+1 . . . , Xjkare independent of Mi; X(i−1)k+1, . . . , Xik

for all i 6= j. The following lemma shows the somewhat curious structure of such se-

quences.

Lemma 1.3.8. Hill (1995a) Let X1, X2 . . . be a sequence of M−randomk−samples

for some k and some r.p.m. M . Then

(i) the Xn are a.s. identically distributed with distribution EM , but are not in

general independent, and

(ii) given M1, M2, . . ., the Xn are a.s. independent, but are not in general identically

distributed.

As Hill state in his paper:

Remark 1.3.3. In general, sequences of M − randomk − samples are not in-

dependent, not exchangeable, not Markov, not martingale, and not stationary se-

quences.

Example 1.3.9. Let M be a random measure which is the Dirac probability measure

δ(1) at 1 with probability 12, and which is δ(1)+δ(2)

2otherwise, and let k = 3. Then M1

will be assigned to δ(1) with probability 12

and M2 otherwise.

20

(i) Since

P (X2 = 2) = P (X2 = 2|M1)P (M1) + P (X2 = 2|M2)P (M2) = 0 +1

2

1

2=

1

4,

but P (X2 = 2|X1 = 2) = P (x2 = 2|M2) = 12, so X1, X2 are not independent.

(ii) Since P ((X1, X2, X3, X4) = (1, 1, 1, 2)) = 964

> 364

= P ((X1, X2, X3, X4) =

(2, 1, 1, 1)), the (Xn) are not exchangeable;

(iii) Since

P (X3 = 1|X1 = X2 = 1) =9

10>

5

6= P (X3 = 1|X2 = 1),

the (Xn) are not Markov.

(iv) since

E(X2|X1 = 2) =3

26= 2,

the (Xn) are not a martingale;

(v) and since

P (X1, X2, X3) = (1, 1, 1)) =9

16>

15

32= P ((X2, X3, X4) = (1, 1, 1)),

the (Xn) are not stationary.

The next lemma is simply the statement of the intuitively fact that the empirical

distribution of M − randomk − samples converges to the expected distribution

of M .

Lemma 1.3.10. Hill (1995a) Let M be a r.p.m., and let X1, X2 . . . be a sequence of

IM − randomk − samples for some k. Then

limn→∞

♯i ≤ n : Xi ∈ Bn

= E[M(B)] (1.3.8)

a.s. for all Borel B ⊂ R.

21

Note that if we choose k = 1, taking fix B and j ∈ N, and let

Yj = ♯Xj ∈ B

then

limn→∞

♯i ≤ n : Xi ∈ Bn

= limn→∞

∑m

j=1 Yj

n(1.3.9)

By 1.3.8 , given Mj,Yj as the Bernoulli case with parameter 1 and E[M j(B)], so by

1.3.7

EYj = E(E(Yj|M j)) = E[M(B)] (1.3.10)

a.s. for all j, since M j has the same distribution of M . By 1.3.8 the Yj are

independent. Since they have 1.3.10 identical means E[M(B)], and are uniformly

bounded, it follows Loeve (1977) that

limn→∞

∑m

j=1 Yj

m= E[M j(B)] (1.3.11)

a.s. This basically it is just the Bernoulli case of the strong law of large numbers.

Remark 1.3.4. Roughly speaking, this law says that if probability distributions are

selected at random, and random samples are then taken from each of these distri-

butions in any way so that the overall process is scale (or base) neutral, then the

significant digit frequencies of the combined sample will converge to the logarithmic

distribution.

At this far a proper definition of a random sequence in terms of the mantissa is

expressed.

Definition 1.3.10. A sequence of random variables X1, X2 . . . has scale-neutral man-

tissa frequency if

|♯i ≤ n : Xi ∈ S − ♯i ≤ n : Xi ∈ sS|n

→ 0 a.s.

22

for all s > 0 and all S ∈M , and has base-neutral mantissa frequency if

|♯i ≤ n : Xi ∈ S − ♯i ≤ n : Si ∈ S12|

n→ 0 a.s.

for all m ∈ N and S ∈M .

Definition 1.3.11. A r.p.m. M is scale-unbiased if its expected distribution EM is

scale invariant on (R+, bmM) and is base-unbiased if E[M(B)] is base-invariant on

(R+, bmM). (Recall that M is a sub σ-algebra of the Borel, so every Borel probability

on R (such as EM) induces a unique probability on (R+, M).)

The main new statistical result, here M(t) denotes the random variable M(Dt),

where

Dt =∞⋃

n=−∞

[1, t)× 10n

is the set of positive numbers with mantissa in [ 110

, t10

). M(t) may be viewed as the

random cumulative distribution function for the mantissa of the r.p.m. M .

Theorem 1.3.11. (Log-limit law for significant digits). Let M be a r.p.m. on

(R+, bmM). The following are equivalent:

(i) M is scale-unbiased

(ii) M is base-unbiased and E[M (B)] is atomless;

(iii) E[M (t)] = log10t for all t ∈ [1, 10);

(iv) every M -random k-sample has scale-neutral mantissa frequency;

(v) EM is atomless, and every M -random k-sample has base-neutral mantissa fre-

quency;

23

(vi) for every M -random k-sample X1, X2 . . . ,

♯i ≤ n|mantissa(Xi) ∈ [ 110

, t10

)n

→ log10t a.s. for all t ∈ [1, 10).

The statistical log-limit significant-digit law help justify some of the recent appli-

cations of Newcomb Benford’s Law, several of which will now be described. Remember

that most of the results on this section are transcribed and commented using Hill

(1996). The proof of each one of the results are included by referencing each lemma

and theorem.

1.4 Mean and Variance of the Dkb

The numerical values of the Significant Digit Law for the first digit can be computed

numerically using these expressions:

E(D(k)b ) =

9∑

i=1

nProb(D(k)b = n) (1.4.1)

V ar(D(k)b ) =

9∑

i=1

n2Prob(D(k)b = n)− E(D

(k)b )2 (1.4.2)

If we state the values for k = 1 to 9. As an example lets suppose as usual that b = 10

then we already know the theoretical values for the distribution of the first and second

significant digit. Then some statistics for this distribution are: The standard devia-

tion is the well know distance of the point to the mean and the variation coefficient

is the ratio between the standard deviation and the mean of the distribution.

These are the most central tendency measured used bay researchers.

24

Table 1.1: Mean, Variance, Standard Deviation and Variation Coefficient for the Firstand Second Significant Digit Distributions.

Mean V ariance STDEV V ariationF irst 3.44024 6.05651 2.46099 0.71536Second 4.18739 8.25378 2.87294 0.68609

1.5 Simulation

Let as usual X be a random variable having Benford’s distribution. Using 1.3.6, then

X can be generated via

X ← ⌊10U⌋ (1.5.1)

where U ∼ Unif(0, 1). Note that the operator ⌊⌋ represent the integer part of the

number between the symbols. Actually the above expression if for the first signifi-

cant digit. The interesting case is how to generate random values from each of the

marginates of the Generalized Newcomb Benford’s distribution for all digits not only

the first. Moreover if there is some bound on the maximum number N (like in elec-

tions). How it would be a ”Newcomb Benford’s Law under a restriction?” how bounds

affect the sample generated?

1.5.1 Generating r Significant Digit’s Distribution Base b.

For this remember that the Significant Digit Law can be stated as:

Fx(x) = log10(1 +1

x) (1.5.2)

25

for x = 10r−1, 10r−1+1, . . . , 10r−1. Then going directly to the definition of probability

the expression above can be written as:

Fx(x) = Pr(X ≤ x)

=∑x

i=10r−1 log10(1 + 1i)

=∑x

i=10r−1(log10(i + 1)− log10 i)

= log10(x + 1)− log10 10r−1 (by the hypergeometric series)

= log10(x + 1)− r + 1

(1.5.3)

Hence for the cumulative distribution function can be stated as:

Fx(x) = log10(x + 1)− r + 1. (1.5.4)

Note that the same derivation can be done using an arbitrary base b. In order to

generate values from this distribution lets suppose that u ∼ Unif(0, 1), and also

suppose as usual that, u = Fx(x). Substituting is 1.5.2, and solving for x are get:

10u+r−1 − 1 = x

using the floor function to get a closed form expression,

X ∼ ⌊10u+r−1⌋. (1.5.5)

Moreover this can be generalized to a base b > 0, for which

X ∼ ⌊bu+r−1⌋ (1.5.6)

where U ∼ Unif(0, 1).

1.5.2 Effects of Bounds in the Newcomb-Benford Generated

Values.

There is an open question; if there is an upper bound of the data values, what

effect this have if any, on Newcomb Benford’s Law? For this, suppose as above that

26

X ∼ NBenford(r, b), that is that X is a random variable distributed as a Newcomb

- Benford’s Law for a digit r and base b > 0. Using equation 1.5.5 we can generate

the marginal distribution applying a modular function base b. Thats is

X ∼ ⌊bU+r−1⌋mod b (1.5.7)

with U ∼ U(0, 1) Note that for b = 10 and r = 2, expression 1.5.5 will generate

numbers from the set 10, 11, 12, . . . , 99. Lets define K be an upper bound, for

experimental observations :

X ← ⌊XU+2−1⌋ (1.5.8)

Then

Z = ⌊10U+1⌋I(0,K](z) (1.5.9)

where I(0,K] is the indicator function defined as

IS(x) =

1, x ∈ S;

0, otherwise.

When we use r = 2 we are generating from the second digit law. There are some

complications at the moment of generate numbers from 1 to 99. Since for this case

there are two different types of numbers, from 1 to 9, the case that the number of

digits is one, and second the numbers from 10 to 99, the number of digits is two.

Since the equation 1.5.7 depends on the number of digits to simulate, there is the

need to simulate proportionally form the set of numbers from 1 to 9 and the set of

numbers 10 to 99. The proportion of the first set of numbers is 19

and 89

for the second

set. The trick here is to generate 1/9 of the sample size using the random numbers

from a N-B Distribution with r = 1 and the other 8/9 of the desired sample form a

27

N-B Distribution with r = 2. This can be generalized for larger r’s. The main topic

in this section is know the way that the N-B Law acts with bounds. For this some

notation is needed.

(i) pBi is the Newcomb Benford Probability Distribution for number i.

(ii) pCi under the constraint N ≤ K is the proportion of the numbers in the set that

will be sampled;

(iii) pU is the proportion of the numbers in the set under no constraints;

As an observation, if there is not a bound then pc = pu.

Example 1.5.1. Suppose that K = 52 then pC1 = 11

52and pU

1 = 19.

Definition 1.5.1. The ”Constrained N-B Law Distribution” is defined as:

P (Di = x|N ≤M) =pB(Dj

i )pC

i

pUi

∑

k pB(Djk)

pCk

pUk

(1.5.10)

Suppose that we want a bound in N = 65 then lets compare how close the theo-

retical function 1.5.10 is close to the simulated using the bound. The following figures

present different simulations with different bounds or constraints;

The conclusion is that the argument at the theoretical Law under constraints

1.5.10 and the simulation is excellent. In fact, equation 1.5.10 may be considerate

the ”Constrained Newcomb Benford Law”. To our knowledge this is the first time it

has been introduced. Note that 1.5.10 can be adapted for lower bound also.

28

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Numbers

Pro

babi

lity

Dis

trib

utio

n

Bound in:99 of generated numbers Benford Dist.

Simulated boundTheory boundNB law

Figure 1.1: Constrained Newcomb Benford Law compared with a Restricted Bound with

of digits in K ≤ 99 from numbers between 1 to 99. Here there is no restriction.

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Numbers

Pro

babi

lity

Dis

trib

utio

n




of digits in K ≤ 50 from numbers between 1 to 99.

29

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Numbers

Pro

babi

lity

Dis

trib

utio

n




of digits in K ≤ 20 from numbers between 1 to 99.

Chapter 2

Empirical Analysis

2.1 Introduction

The analysis of the uncertainty is old as the civilization itself. There are several

interpretations of those phenomenons that rule the Nature in the most general case.

In modern times the basis of this theory are lectures from Bernoulli, Laplace and

Thomas Bayes. Characterize the knowledge in chance and uncertainty using measure

tools provided by the Logic is the fundamental baseline in most of the results here.

There are distinctions in Classical and Bayesian Statistics. We discuss at least the null

hypothesis probabilities and the p-value. Explore a concise analysis of the Newcomb

Benford sequences in a Bayesian scheme using state to the art tools in order to

calculate how close the data is close to the Law. Finally we state some examples in

order to taste how the Newcomb Benford Law works as the mixture in probability

random variables get more complicate.

30

31

2.2 Changing P-Values in Null Hypothesis Prob-

abilities H0

The p- value is the probability of getting values of the test statistic as extreme as, or

more extreme than, that observed if the null hypothesis is true. For a single sample

the χ2Statistics is given as,

χ2Statistics =

9∑

D=1

(Prob(D(1)10 = D)− fD)2

Prob(D(1)10 = D)

(2.2.1)

Where fD, are the first digit of each data entry. This is the basis of a classical test to

the null hypothesis which is that the data follows the Newcomb Benford Law. If the

null hypothesis is accepted the data ”passed” the test. If not, it opens the possibility

of being manipulated data. As we had presented most of the data that we pretend

to analyze respond as a group of random mix of models. We have to specify our null,

H0, and is alternative, H1, hypothesis. In the electoral process sense is; H0 means

that there are no intervention with the data on the other hand, H1 means that there

is intervention in the data gathering process. Is important to get the Null Hypotheses

measure against it is evidence. In our case if the data obeys Benfords Law implies

that there is no intervention in the electoral votes.

There is a misunderstood between the probability of the Null Hypothesis and the

p− value.

For a Null Hypotheses, H0, we have (Berger O. J., 2001):

Pval = Prob(Result equal or more extreme

than the data—Null Hypothesis)

32

[tbp]

Table 2.1: p- values in terms of Hypotheses probabilities.

Pval P (H0|data)0.05 0.290.01 0.110.001 0.0184

If the p- values is small (Ex. p-values < 0.05 or less) there is a significant observation.

But p- values are not null hypotheses probabilities. If P (H0) = P (H1),surprise that

H0 has produced an unusual observation, and for Pval < e−1, then:

P (H0|data)

P (H1|data)≥ −e.Pval. loge[Pval] ⇒

P (H0|Pval) ≥ P (H0|data) = 1/(1 + [−e.Pval. loge(Pval)]−1)

A full discussion about this mater can be found in, (Berger O. J., 2001) Is more

natural to calculate p-value with respect the goodness of fit test of the proportions in

the observed digits versus those in proportions specified by the Newcomb-Benford’s

Law. As we can see in the 2.2, the correction is quite important in order to improve

the calculations. This table show how larger this lower bound is than the p- values.

So small p- values (i.e.pval = 0.05), imply that the posterior probability at the null

hypotheses is at least 0.29, is not very strong evidence to reject a hypotheses. As

an alternative procedure we can use the BIC (Bayesian Information Criterion) or the

Schwartz’s Criteria (Berger J.O. and Pericchi L. R., 2001) that take the sample size

in explicit form:

33

log[P (H0|data)

P (H1|data)] ≈ log(Likelihood Ratio) +

k1 − k0

2log(N) (2.2.2)

The likelihood ratio can be calculated from a multinomial density distribution.

In the numerator we will have the proportions assigned from the H0 and in the

denominator the data digit proportions. The evidence against the null hypothesis

can be measure using the BIC. In this case the null hypothesis represents that the

data follows a N-Benford’s Law distribution.

2.2.1 Posterior Probabilities with Uniform Priors

Let Υ1 be set of integers in [1, 9] and Υ2 be the integers included in the interval [0, 9].

The elements that may appear when the first digit be observed will be members of

Υ1 and if we were meant to observe the second digit or other different from the first

position will be member of Υ2. In the case that the same index can be applied to the

first or any other digit member of Υ1 or Υ2 we will refereed as a member of Υ. Let

Ω = p1 = p01, p2 = p20, . . . , pk = pk0|k

∑

i=1

p0i = 1

for k = 1, . . . , 9 in the case of the first digit and k will be extended to 10 for other

digit. Note that using the defined set above we can rewrite Ω in terms of Υ as follows;

Ω = p1 = p01, p2 = p20, . . . , pk = pk0∀k ∈ Υ|∑

i∈Υ

p0i = 1.

Then our hypothesis can be write as:

H0 = Ω

H1 = Ωc(2.2.3)

where Ωc means the complement of Ω. In other words

Ωc = pi 6= p0i∀i ∈ Υ.

34

Assume an uniform prior for the values of the pis, then,

Πu(p1, p2, . . . , pk) = Γ(k) = (k − 1)! (2.2.4)

We can write the posterior probability of H0 in terms of the Bayes Factor. Let x be

de data vector and by definition of Bayes Factor we have that:

B01 =P (H0|x)P (H1)

P (H1|x)P (H0)(2.2.5)

If we have nested models and P (H0) = P (H1) = 12, then the Bayes Factor reduces to

B01 = P (H0|x)P (H1|x)

= P (H0|x)1−P (H0|x)

= 11

P (H0|x)−1

⇔ 1P (H0|x)

− 1 = 1B01

⇔ 1P (H0|x)

= 1B01

+ 1

⇔ 1P (H0|x)

= B01+1B01

(2.2.6)

therefore

P (H0|x) =B01

B01 + 1(2.2.7)

For the i significant digit of each element of the data vector n = (n1, n2, . . . , nk) that

ni means the times that appear i in each element of the data. Recall that if we observe

the first digit then i ∈ Υ1 but for the second and onwards i ∈ Υ2, or more general

as the convention i ∈ Υ for any of the cases above. Using the definition applied to

problem 2.2.3, we have

B01 =f(n1, n2, n3, . . . , nk|Ω)

∫

Ωc f(n1, n2, n3, . . . , nk|Ωc)ΠU(p1, p2, p3, . . . , pk)dp1dp2dp3 . . . dpk−1

35

with∑

i∈Υ pi = 1 and pi ≥ 0∀i ∈ Υ. Substituting in our problem

B01 =

n!∏k

i=1 ni!

∏k

i=1 pni

i0

(k − 1)!∫ +∞

−∞n!

∏ki=1 ni!

∏k

i=1 pni+1−1i0 dpi

Cancel several factorial terms and using the following identity:

∫ +∞

−∞

k∏

i=1

pni+1−1i0 dpi =

∏k

i=1 Γ(ni + 1)

Γ(n + k)

Follows to a simplified expression for B01:

B01 =pn1

10pn220 · · · pnk

k0

(k − 1)!∏k

i=1 Γ(ni+1)

Γ(n+k)

(2.2.8)

Then we already know how get the posterior probability using the Bayes Factor (using

2.2.7) then substituting B01 we have:

P (H0|x) =

pn110 p

n220 ···p

nkk0

(k−1)!

∏ki=1

Γ(ni+1)

Γ(n+k)

pn110 p

n220 ···p

nkk0

(k−1)!∏k

i=1Γ(ni+1)

Γ(n+k)

+ 1(2.2.9)

There are different forms to calculate the probability of the null hypothesis given

a certain data; each one depends on the prior’s knowledge and the type of Bayes

Factor or an approximation in use (i.e. P (Frac) is based on the Fractional Bayes

Factor (Berger J.O. and Pericchi L. R., 2001))

BF FRAC01 =

f0(data|p0)∫

Ω f1(data|p)πN (p)d(p)

=∫

Ωf

rn1 (data|p)πN (p)d(p)

frn (data|p0)

(2.2.10)

where p0 is given by the Newcomb Benford Law and r is the number of adjustable

parameters minus one, that is r = 8 or r = 9, for the first and second digit respectively.

The P (Approx) is based on the following approximation on the Bayes factor;

BFApprox01 = (

f0(data|p0)

f1(data|p))1− r

n (n

r)

r2 (2.2.11)

36

where p in the maximum likelihood estimator of p. And the GBIC is based on a still

unpublished proposal by (Berger J.O., 1991) This is based on the prior in (Berger

J.O., 1985).

2.3 Multinomial Model Proposal

In the following case let ti digit then i ∈ Υ as usual. This can be think (Ley, 1996) as

a random variable N distributed a multinomial distribution with vector parameter

θ; thus

f(N |θ) =(∑

j∈Υ nj)!∏

j∈Υ nj!

∏

j∈Υ

θnj (2.3.1)

As usual we will assume uniform as a prior knowledge for theta whit mean 1k

where

k be the cardinality of the set Υ, thats means if we are working with the first digit

then k = |Υ1| = 9 and if the observes significant digit is the second or more then

k = |Υ2 = 10, that is for each one of the θ. The natural conjugate prior is a Dirichlet

density. This distribution has the following general form;

Dik(θ|α) = c(1−k

∑

l=1

θl)αk+1−1

k−1∏

l=1

pαl−1l (2.3.2)

Where

c =Γ(

∑k+1l=1 αl)

∏k+1l=1 Γαl

and α = (α1, α2, . . . , αk+1) such that every α > 0 and p = (p1, p2, . . . , pk) with

0 < pi < 1 and∑k

l=1 pl = 1. For simplicity we will use each αi = α; thus

g(p) =Γ(kα)

Γ(α)k

∏

j∈Υ

pα−1j (2.3.3)

37

The posterior distribution of the p is given by a Dirichlet whit parameter α+n1, α+

n2, . . . , α + n9. Then we have that

h(θ|x) =Γ(kα +

∑

j∈Υ nj)∏

j∈Υ Γ(α + xj)

∏

j∈Υ

θα+nj−1j (2.3.4)

2.4 Examples

Our aim now is to show empirically how efficient can be the reasoning 1.3.1 given

by (Hill, 1996). Our first examples denote an exponential family distribution func-

tion. Most of the application involve that involve a multilevel analysis are called

a hierarchical models. This type of models allow a more ”objective” approach to

inference by estimating the parameters of prior distributions from data rather than

requiring them to be specified using subjective information (Gelman A., 1995, Carlin

Bradley P., 2000).

Example 2.4.1. The simplest model that we present here is a Poisson Model with a

fixed parameter λ. For this first case the 500 values are simulated with λ = 100. The

P (H0|data) = 0, that indicate how poor is this model to simulate a Benford Process.

As Hill stated and as we had discussed in early chapters, the NB Law can be satisfied

if there is a random mixture of mixture distributions. In the Figure 2.2 is show how

poor is the frequencies of the first digit of the simulated values compared with the N-B

Law for the first digit.

Remember that this model is the simple one, do not have a hierarchical structure.

Example 2.4.2. The following is a simple hierarchical model have two stages some

of the parameters are fixed. A frequently model used in actuarial sciences, and quality

38

1 2 3 4 5 6 7 8 9

0.05

0.1

0.15

0.2

0.25

0.3P

rop

ort

ion

Number

Marginal Posterior Boxplot of Newcomb Benford for First Digit.

(a)First digit boxplot.

0 1 2 3 4 5 6 7 8 9

0.08

0.09

0.1

0.11

0.12

0.13

Pro

port

ions

Number

Marginal Posterior Boxplot of Newcomb Benford for Second Digit.

(b) Second digit boxplot.

Figure 2.1: Presenting the posterior intervals for the first and digit using symmetricboxplot.

39

control.

n ∼ Pois(λν)

λ ∼ G(θ, α)(2.4.1)

The probability distribution is given by

Pg(n|α, β, ν) =

∫ 1

0

Pois(n|λν)G(λ|α, β)dλ

The resulting expression is know as the generalization of negative binomial distribution

Nb(n|α, β

β+ν).

Pg(n|α, β, ν) =

∫ 1

0

e−λν(λν)n

Γ(n + 1)

βαλα−1e−βλ

Γ(α)dλ

=βανn

Γ(α)Γ(n + 1)

∫ 1

0

λn+α−1e−(β+ν)λdλ

=Γ(n + α)νn

Γ(n + 1)Γ(α)

(

β

β + ν

)α (

ν

β + ν

)n

First we will think the Gamma part of the model above, as a mixture of different

distributions of the parameter λ in the Poisson distribution function.

The values of the different values of the set of parameters λ will be 10, 20, 30, 50, 70.

Each vector of the overall simulated data will correspond to the Poisson model whit

partitions of length 50. Making the Benford analysis we get P (H0|data) = 0.878719187.

Here we can denote that for this small examples of mixtures the Newcomb Benford

Law works. Note that in the graph Figure 2.3 is show how close the real Law is to

the simulated values.

In the Model 2.4.1, instead of use the discrete version for the λ distribution here

we simulate using a Uniform prior on the parameters in the Gamma distribution

40

function. The model that in this example is implemented goes as follows.

n ∼ Pois(λν)

λ ∼ G(α, β)

α, β ∼ Unif(1, 500)

(2.4.2)

This simulation is an extension of the model 2.4.1. In general this is a Negative

Binomial family of distributions, indeed is a mixture of distributions itself. In Figure 3

we can appreciate the histogram of the cumulative distribution(a) and the proportions

of the significant digits whit the N-B Law for the first digit law proportions. Here the

probability of the null hypothesis given the data is 1. The table 1 show a resume of

the overall results.

Example 2.4.3. The Multinomial Model is a rich source of mixtures since that if you

are observing an electoral process you can seen different parameters for the probability

values of each candidate per region in a country. As a little experiment suppose that

you have two candidates and some of the persons in a electoral college of a particular

country do not want to vote then for that particular region you will have a parameter

vector p = [p1, p2, p3] whit p1 + p2 < 1 and p3 = 1 − p1 − p2. Recall that p3 is the

probability of people that do not vote for any of the candidates. For this particular

simulation 1000 electoral colleges are simulated in 10 regions. As we had said there

are two candidates. The joint density function of all data is presented in Figure 4.

The P (H0|data) = 1 for 29058 simulated data. A summary of the examples are

presented in Table 2.4.

41

Table 2.2: Summary of the results of the above examples.

Example Simulated length of data P (H0|data) p− valuePoisson Model 500 0 0Pois-Gamma Discrete 250 0.991 0.008Neg - Binomial 500 0.989 0.001Multinomial 29058 0.999 0.002

2.5 Conclusions of the examples

Note that since you complicate the hierarchy in each model then an approach to the

N-B Law frequencies in the first digit can be found easily. More complicated is the

model, more the approach to the N-B Law. The restrictions in the parameters affect

the statistical closeness to the Benford Law.

42

60 70 80 90 100 110 120 130 1400

50

100

150Histogram of the Simplest Poisson Model with λ = 100.

Values

Fre

quen

cies

1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8Poisson model

Digits

Pro

port

ions

1st Digit Law.Empirical simulation.

Figure 2.2: Newcomb-Benford Law theoretical frequencies for the first significant digit.

0 10 20 30 40 50 60 70 80 900

20

40

60Histogram of the Poisson model whit the partition according to the different lambda parameters.

Values

Fre

quen

cies

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4Discrette Gamma−Poisson model

Digits

Pro

port

ions



This represent the example 1 simulation results.

43

0 0.5 1 1.5 2 2.5 3

x 105

0

50

100

150

200Histogram of the Negative Binomial Model.

Values

Fre

quen

cies

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4Negative Binomial Model versus 1st digit N−B Law

Digits

Pro

port

ions



This represent the example 2 simulation results.

0 50 100 150 200 250 300 3500

50

100

150

200

250

300

350Histogram of the Hierarchical Multinomial model.

Values

Fre

quen

cies

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4

0.5Multinomial model versus 1st digit N−B Law

Digits

Pro

port

ions



This represent the multinomial example simulation results.

Chapter 3

Stock Indexes’ Digits

3.1 Introduction

Several research has focused on the studying the patterns in the digits of closely fol-

lowed stock market indexes. Here we present a similar analysis presented by Ley(1996)

in which is presented how the series of 1 -day returns on the Dow Jones Industrial

Average Index(DJIA) and the Standard and Poor’s Index (S&P ) reasonably agrees

with the Newcomb - Benford Law. In this chapter we presented the case of the Puerto

Rican Stock Index (PRSI) returns levels as a part of the anomalous family numbers.

Here we focused on study patterns in the digits of the PRSI one day returns levels.

3.2 Statistical Analysis

Let pt be the closing value of the PRSI at time t. The one day return on the index,rt,

is defined as

rt =ln(pt+1)− ln(pt)

dt

∗ 100 (3.2.1)

44

45

12/29/1995 11/29/1996 11/28/1997 10/30/1998 9/30/1999 8/31/2000 6/29/2001 5/31/2002 4/30/2003 2/27/2004 −0.5

0

0.5

1

1.5

2

2.5x 10

4 Puerto Rico Stock Index Timeline

PRSIrt with dt=1

−2000 −1500 −1000 −500 0 500 1000 1500 20000

5

10

15

20

25

30

35

40

Histogram of the rt

46

where dt is the number of days between trading days t and t + 1. Looking only at

the first significant digit of rt for dt = 1, we then obtain a vector x = (x1, x2, . . . , x9)

where xi is the frequency of r′is first significant digit is i ∈ 1, 2, . . . , 9. In the case

that we study the second digit then i ∈ 0, 2, . . . , 9. This can be think as a random

variable X distributed a a multinomial distribution whith vector parameter θ; thus

f(x|θ) =(∑9

j=1 xj)!∏9

j=1 xj !

9∏

j=1

θxj (3.2.2)

As usual we will assume uniform as a prior knowledge for theta whit mean 19

for each

one of the θ. The natural conjugate prior is a Dirichlet density. This distribution has

the following general form;

Dik(x|α) = c(1−k

∑

l=1

xl)αk+1−1

k∏

l=1

xαl−1l (3.2.3)

Where

c =Γ(

∑k+1l=1 αl)

∏k+1l=1 xαl

l

and α = (α1, α2, . . . , αk+1) such that every α > 0 and x = (x1, x2, . . . , xk) with

0 < xi < 1 and∑k

l=1 xl less than unity. For simplicity we will use each αi = α; thus

g(θ) =Γ(9α)

Γ(α)9

9∏

j=1

θα−1j (3.2.4)

The posterior distribution of the θ is given by a Dirichlet whit parameter α+x1, α+

x2, . . . , α + x9. Then we have that

h(θ|x) =Γ(9α +

∑9j=1 xj)

∏9j=1 Γ(α + xj)

9∏

j=1

θα+xj+1j (3.2.5)

3.3 Results

Bi definition in the tables, Diff − dt represents the difference of consecutive prices.

The pt respond to the definition above (3.2.1).

47

Data Pval P (H0|data) P(Approx)1st digitDiff − pt 0.0000 0.0000 0.0000rt 0.1404 0.4283 0.89692nd digitDiff − pt 0.4322 0.4964 0.9853rt 0.3498 0.4997 0.9798

Table 3.1: N-Benford’s for 1st and 2nd digit: p- values, Probability Null Bound andApproximate probability for of the different increment

Data P (H0|data) num.observ.1st digitDiff − pt 0.0000 109rt 0.9935 1092nd digitDiff − pt 0.9996 108rt 1.0000 108

Table 3.2: N-Benford’s for 1st and 2nd digit: The probability of the null hypothesisgiven the data and the length of the data.

48

As in the Ley (1996) we had found that the PRSI

1. Puerto Rico’s Stock Market Index is part of the anomalous family of numbers.

2. Small changes as the 1% are more common than others.

3. Since the PRSI obey the Newcomb Benford then is part this Stock index is part

of a mixture of random distributions.

Chapter 4

On Image Analysis in theMicroarray Intensity Spot

An application of the results on the previous chapter 2 is part of the Image Analysis

in the Microarray Intensity Spot. This result immediately implies a relation between

the use of a normalization criteria and a well fitted response to the Newcomb Benford

Law. Already in know that Microarray intensity spot obey Newcomb - Benford Law

Hoyle David C. (2002). This chapter can be view as an extension of some of the

results obtained by Hoyle David C..

4.1 Introduction

The image analysis is an important aspect of the microarray tool. DNA Microar-

ray represents an important new method for determining the complete expression

profile of a cell. In ”spotted” Microarray, slides carrying spots of target DNA are

hybridized to fluorescence labeled cDNA from experimental and control cells and the

arrays are imaged at two or more wavelengths. The aim of the Newcomb Benford

Law approach is a general Pixel-by-pixel analysis of individual spots that can be used

49

50

to estimate these sources of error and establish the precision and accuracy with which

gene expression ratios are determined. A well established filtering is effective in im-

proving significantly the reliability of databases containing information from multiple

expression experiments. For this, it is used a normalization criteria that includes

background spot intensity measures.

To standardize the removing of sources of systematic variation in microarray ex-

periments witch affect the measured gene expression levels. This will allow cross-

comparison of different experiments. Types of variations to normalize:

1. Differences in labeling efficiency between the two dyes.

2. Differences in the power of the two lasers.

3. Differing amounts of RNA labeled between the 2 channels.

4. Spatial biases in ratios across the surface of the microarray.

N-B Law can give us tools to identify excess of noise in the intensity spot of a

gene expression data. We are analyzing data provided by Y. Robles (2003). A brief

description of the experiment proceeds as follows:

4.2 Experiment

In general a Microarray calibration proceeds as the following description.

Several levels of replications are embedded in the design of the calibration experiments

and the resulting data provide information on the relative importance of variations

due to spots, labels, and slides. Based on this information, we formulate an approach

to the analysis of comparative experiments. The main components are as follows:

51

1. Extract intensities from the scanned images of both dyes.

2. Detect and filter poor quality genes on a slide using measurement from multiple

spots. This procedure is not applicable in singly spotted designs.

3. Perform slide-dependent nonlinear normalization of the log-ratios of the two

channels.

4. Use hierarchical model-based analysis on normalized log-ratio scale, where as-

sessment of the significance of gene effects are aided by statistical information

obtained from calibration experiments, if they are available.

After hybridization and washing, slides are scanned by a laser or CCD scanner

( Speed T. P. (2003)). The scanner then produces green Cy3 and red Cy5 16-bit

TIFF image files. The intensity of each pixel in these images thus ranges from 0 to

216-1(= 65, 535). Image analysis in microarray experiments is a set of processes to

extract meaningful intensities of each spot from the raw image for further analysis.

The major components usually include:

1. Locating the spots. We need to first locate spot positions. Information like

number of spots and prior rough positions are known from the arrayer (spotting

machine), but an algorithm is needed to search for the exact location in the

neighborhood. Usually some manual adjustments are needed.

2. Segmentation. This consists of deciding the shape of the spots and identifying

foreground and background pixels. Some algorithms use only fixed diameters

and round spot regions for each spot, some allow flexible diameters but use

only round shapes, and some others allow both flexible diameters and irregular

shapes. Background and foreground regions are then determined.

52

3. Intensity extraction. Local background intensities of each spot are then es-

timated and subtracted from the foreground intensities to account for cross-

hybridization of non target genes and fluorescence emitted from other chemi-

cals.

Various statistics including mean intensities, median intensities, and standard devia-

tion of the background and foreground of each dye are reported. Some of the statistics

are used to provide intensity extraction, and others are used for quality control. The

spot summary information is very useful for the automation of quality filtering and

further analysis.

4.2.1 Microarray measurements and image processing

Although the cDNA microarray experiment has been developed for several years, its

image analysis is still an active area of research (Yang Y. H. et. al. 2002). It has some

major difficulties. First of all, each cDNA clone usually contains several hundreds of

pixels, and the locations and shapes of these spots may vary depending on the quality

of the experiment. No fully automated algorithm can perfectly locate the spots and

identify the regions on every slide, and most current software provides easy interface

for manually adjusting spots that are wrongly identified by its algorithm. For some

bad quality slides, these corrections may require tremendous labor. Second, a fast

algorithm is necessary to deal with large data sets. Finally, many statistics are pro-

posed to serve as quality indices. They are very useful in the case of misidentifying

spot locations, local slide contamination, and poor spot quality. Some statistics are

useful only to test some specific artifacts, however, and a good method to combine

these statistics for correctly filtering all kinds of defective genes is not available yet.

53

Researchers often use a ”log ratio” between expression values of a gene in two

arrays as the criteria to identify differentially expressed genes. Between duplicate

arrays, we expect these ”log ratios” of expression values based on a good expression

index to be close to zero.

4.3 Results

The following table present some of the indexes used to verify the evidence against

the Newcomb Benford Law (NBL). Remember that in this example we are using In-

tensity values provided by the microarray data. Is important to remark how high the

posterior probability is given the data in the test. As we have said this show how

bad the p− value can be to measure and discriminate the closeness to the Significant

Digit Law.

Note that for the first digit test the Intensity 1 and the adjusted intensity do not

pass NBL test. Even the Intensity 1 at the first digit, but the second digit passes this

test. The adjusted data pass the test in the adjusted intensity. The first digit almost

is not true for each of the intensities.

Some of the results are summarize as follows:

1. The intensity spots Adjusted (using normalization transformation) is more rich

in mixtures of distributions than the intensity of the raw data.

2. Is possible to improve the quality of the procedures with just this simply test

in this particular case a Newcomb Benford test can be done in order to get an

54

Spot P (H0|data) P (Approx) P (Frac) Pr(BIC)1st DigitInt 1 0.000 0.000 0.000 0.000Int 2 0.000 0.000 0.000 0.000Adj Int 1 0.9999 0.9999 0.9999 0.9999Adj Int 2 0.0034 0.0001 0.0001 0.78692nd DigitInt 1 0.9999 0.9999 0.9999 1Int 2 0.0000 0.0000 0.0000 0.0361Adj Int 1 0.9999 0.9999 0.9999 1Adj Int 2 0.9999 0.9999 0.9999 1

Table 4.1: N-Benford’s for 1st and 2nd digit: P (H0|data), P (Approx), P (Frac) andPr(BIC).

overall framework of the effectiveness in the normalization transformation of

the intensity spot raw data.

We suggest that further research in this direction is likely going to reveal additional

properties of the Newcomb Benford Laws act on the underlying spot intensities on

the Microarray.

55

Spot Observed p-values P (H0|data)1st DigitInt 1 1185 0 0Int 2 1185 0 0Adj Int 1 1166 0.02996 0.22222Adj Int 2 1137 0.00000 0.000002nd DigitInt 1 1185 0.220847098 0.475522952Int 2 1185 0.00000 0.00000Adj Int 1 1162 0.331673888 0.49874Adj Int 2 1137 0.77854 0.34632

Table 4.2: N-Benford’s for 1st and 2nd digit; the number of observations, p-values.

0 1 2 3 4 5 6 7

x 104

0

100

200

300

400

500

600

700

800

900Microarray Intensity 1

Values

Fre

quen

cies

Intensity

0.3 0.9 1.5 2.1 2.7 3.3 3.9 4 5.1 5.70

200

400

600

800

1000Microarray Adj. Intensity 1

Values

Fre

quen

cies

Intensity

0 1 2 3 4 5 6 7

x 104

0

100

200

300

400

500

600Microarray Intensity 2

Values

Fre

quen

cies

Intensity

0.28 0.861.44 2.01 2.59 3.16 3.744.32 4.89 5.470

100

200

300

400

500

600

700Microarray Adj. Intensity 2

Values

Fre

quen

cies

Intensity

Figure 4.1: Histograms of the Intensities and the Adjustments.

56

2 4 6 80

0.1

0.2

0.3

0.4

Newcomb−Benford First Digit Law

and Microarray Intensity 1.

Digits

Pro

port

ions

0 2 4 6 80

0.05

0.1

0.15

0.2

Newcomb−Benford Second Digit Law


Digits

Pro

port

ions

2 4 6 80

0.1

0.2

0.3

0.4



Digits

Pro

port

ions

0 2 4 6 80

0.05

0.1

0.15

0.2



Digits

Pro

port

ions

1st Digit Law

Intensity

2st Digit Law

Intensiy

1st Digit Law

Intensity

2st Digit Law

Intensiy

Figure 4.2: N-Benford’s Law compared whit Intensity Micro array Spots Without Adjust-

ment.

2 4 6 80

0.1

0.2

0.3

0.4


and Microarray Adj. Intensity 1.

Digits

Pro

port

ions

0 2 4 6 80

0.05

0.1

0.15

0.2



Digits

Pro

port

ions

2 4 6 80

0.1

0.2

0.3

0.4



Digits

Pro

port

ions

0 2 4 6 80

0.05

0.1

0.15

0.2



Digits

Pro

port

ions

1st Digit Law

Intensity

2st Digit Law

Intensiy

1st Digit Law

Intensity

2st Digit Law

Intensiy

Figure 4.3: N-Benford’s Law compared whit Intensity Micro array Spots With Adjustment.

Chapter 5

Electoral Process on a NewcombBenford Law Context.

5.1 Introduction

As we had specified in the introduction(see Chapter 1), the electoral process is part

of the non dimensional data. All the examples that are presented below are part of a

Democratic system, in order to be rigorous a democratic electoral process is defined

here.

Definition 5.1.1. Democratic Electoral Process A system is democratic if:

(i) It permits only eligible voters to vote. (e.g. registered citizens).

(ii) It ensures that each eligible voter can vote only once and each vote is equally

weighted. (Equality).

As a description of some principles that role the democratic election are:

1. The Doorkeeper Principle(Only the population and not outsiders). Each

person desirous of voting must be personally and positively identified as an

57

58

eligible voter and permitted to complete no more than the correct number of

ballot papers.

2. The secrecy principle. Admitted voters must be permitted to vote in secret.

3. The verification, tally and audit principle. There must be some mecha-

nism to ensure that valid votes, and only valid votes, are received and counted.

This system must be sufficiently open and transparent to allow scrutiny of the votes

and subsequently the working of the political process. Our attention will be focused

in organize the democratic election description.

5.2 General Democratic Election Model

There is assumed particularly that there are Ci voting centers for i = 1 . . .K. Let M

be a random variable equal to the number of different votes in a particular election.

We need a level that explain each of the centers, terminals on tables. Moreover we

is a assume that the Electronic vote is different from the manual vote(traditional).

Extension of the work of Katz Jonathan (1999) will presented here. As part of the

Electoral Process Modeling a Statistical Model for multiparty electoral data is needed.

There are several structures that certainly in practice precedes the electoral scheme.

Our aim is present a model that simulate electoral polls close to a real democratic

election. There is literature that develop these topic Gelman A. (1995). Our basic

goal is not predict a electoral polls but simulate a real electoral process.

Suppose there is a set of Electoral colleges P = p1, p2, . . . , pk note that the cardinality

of P is k, and let the set C = c1, c2, . . . , cj be the sets of candidates in the election.

59

Winner Elections Observed P (H0|data) P (Approx) P (Frac)Puerto Rico 1992 104 0.99990 0.99759 0.99782Puerto Rico 1996 1836 1.00000 1.00000 1.00000Puerto Rico 2000 1823 1.00000 1.00000 1.00000Puerto Rico 2004 1924 1.00000 1.00000 1.00000Venezuela 2004 Audit 192 0.99981 0.99590 0.99568Venezuela 2004 AUTO 19064 0.00000 0.00000 0.00000Venezuela 2000 AUTO 6876 0.12879 0.00648 0.00587Venezuela 1998 AUTO 16646 0.00000 0.00000 0.00000Venezuela 2004 MAN 4556 1.00000 1.00000 1.00000Venezuela 2000 MAN 3540 1.00000 1.00000 1.00000Venezuela 1998 MAN 3410 1.00000 1.00000 1.00000

Table 5.1: The second digit proportions analysis of the winner for the set of historicalelections.

5.3 Empirical Data

Several Electoral Process are analyzed here. All of them agree whit definition 5.1.1,

and moreover some of them have different types of data collection process. Elections

of various year from a range between 1992 to 2004, will been part of the following

discussion. Puerto Rico, Venezuela and United States of North America are the

scenarios for those events. In the Venezuela’s and Puerto Rico’s Election each citizen

grater than 18 years old can participate. In the Venezuela’s cases there are two forms

of voting methods; electronically and manual. As we will show there are differences

between the electronic and the manual votes. The prefix AUTO mean Electronic polls

and the prefix Audit is referred as the Carter Center Audit Results Carter Center

(2005) and Perichi L.R. and Torres D. (2004a).

60

Loser Elections Observed P (H0|data) P (Approx) P (Frac)Puerto Rico 1992 104 0.99956 0.99012 0.99048Puerto Rico 1996 1839 1.00000 1.00000 1.00000Puerto Rico 2000 1878 1.00000 1.00000 1.00000Puerto Rico 2004 1917 1.00000 1.00000 1.00000Venezuela 2004 Audit 192 0.99998 0.99947 0.99947Venezuela 2004 AUTO 19063 1.00000 1.00000 1.00000Venezuela 2000 AUTO 6872 1.00000 1.00000 1.00000Venezuela 1998 AUTO 16638 0.00000 0.00000 0.00000Venezuela 2004 MAN 4379 1.00000 1.00000 1.00000Venezuela 2000 MAN 3219 1.00000 0.99999 0.99999Venezuela 1998 MAN 3388 1.00000 1.00000 1.00000

Table 5.2: The second digit proportions analysis of the loser for the set of historicalelections.

Diff 1 Elections Observed P (H0|data) P (Approx) P (Frac)Puerto Rico 1992 104 0.99807 0.94288 0.95386Puerto Rico 1996 1870 1.00000 0.99992 0.99993Puerto Rico 2000 1907 1.00000 1.00000 1.00000Puerto Rico 2004 1992 1.00000 0.99986 0.99988Venezuela 2004 Audit 192 0.99788 0.96492 0.96272Venezuela 2004 AUTO 19017 0.00000 0.00000 0.00000Venezuela 2000 AUTO 6873 0.00000 0.00000 0.00000Venezuela 1998 AUTO 16606 0.00000 0.00000 0.00000Venezuela 2004 MAN 4604 0.00000 0.00000 0.00000Venezuela 2000 MAN 3611 0.99999 0.99974 0.99977Venezuela 1998 MAN 3495 0.99397 0.84245 0.85592

Table 5.3: The first digit proportions of the distance between the winner and theloser for the set of historical elections.

61

Diff 2 Elections Observed P (H0|data) P (Approx) P (Frac)Puerto Rico 1992 104 0.99994 0.99824 0.99850Puerto Rico 1996 1660 1.00000 1.00000 1.00000Puerto Rico 2000 1720 1.00000 1.00000 1.00000Puerto Rico 2004 1652 1.00000 1.00000 1.00000Venezuela 2004 Audit 189 0.99978 0.99515 0.99487Venezuela 2004 AUTO 18321 1.00000 1.00000 1.00000Venezuela 2000 AUTO 6745 1.00000 1.00000 1.00000Venezuela 1998 AUTO 15696 0.99939 0.98491 0.98399Venezuela 2004 MAN 4377 1.00000 1.00000 1.00000Venezuela 2000 MAN 3219 1.00000 1.00000 1.00000Venezuela 1998 MAN 2954 1.00000 1.00000 1.00000

Table 5.4: The second digit proportions of the distance between the winner and theloser for the set of historical elections.

Total Elections Observed P (H0|data) P (Approx) P (Frac)Puerto Rico 1992 104 0.99326 0.88875 0.87834Puerto Rico 1996 1867 1.00000 1.00000 1.00000Puerto Rico 2000 1898 1.00000 1.00000 1.00000Puerto Rico 2004 1981 1.00000 1.00000 1.00000Venezuela 2004 Audit 192 0.00000 0.00000 0.00000Venezuela 2004 AUTO 19064 0.00000 0.00000 0.00000Venezuela 2000 AUTO 6877 0.00000 0.00000 0.00000Venezuela 1998 AUTO 16647 0.00000 0.00000 0.00000Venezuela 2004 MAN 4599 1.00000 1.00000 1.00000Venezuela 2000 MAN 3589 1.00000 1.00000 1.00000Venezuela 1998 MAN 4597 1.00000 1.00000 1.00000

Table 5.5: The second digit proportions of the sum between the winner and the loserfor the set of historical elections.

62

Votes P (H0|data) P (Approx) P (Frac)First DigitBush 1.00000 0.99999 0.99999Kerry 1.00000 0.99998 0.99998Nader 1.00000 1.00000 1.00000Second DigitBush 1.00000 1.00000 1.00000Kerry 1.00000 1.00000 1.00000Nader 1.00000 1.00000 1.00000

Table 5.6: The Newcomb Benford’s for 1st and 2nd digit: for the United States ofNorth America Presidential Elections 2004. Note the close are the values of theposterior probability given the data to 1.0.

Winner Elections Observed p-values P (H0|data)Puerto Rico 1992 104 0.81715 0.30965Puerto Rico 1996 1836 0.55428 0.47064Puerto Rico 2000 1823 0.97930 0.05275Puerto Rico 2004 1924 0.15372 0.43899Venezuela 2004 Audit 192 0.25674 0.48689Venezuela 2004 AUTO 19064 0.00000 0.00000Venezuela 2000 AUTO 6876 0.00000 0.00000Venezuela 1998 AUTO 16646 0.00000 0.00000Venezuela 2004 MAN 4556 0.15527 0.44013Venezuela 2000 MAN 3540 0.36603 0.50000Venezuela 1998 MAN 3410 0.01614 0.15327

Table 5.7: The second digit proportions analysis of the winner for the set of historicalelections.Number of observed values, p-value and probability null bound is shown.

63

Loser Elections Observed p-values P (H0|data)Puerto Rico 1992 104 0.56878 0.46593Puerto Rico 1996 1839 0.13775 0.42604Puerto Rico 2000 1878 0.43630 0.49589Puerto Rico 2004 1917 0.53800 0.47550Venezuela 2004 Audit 192 0.59723 0.45558Venezuela 2004 AUTO 19063 0.02401 0.19575Venezuela 2000 AUTO 6872 0.01731 0.16025Venezuela 1998 AUTO 16638 0.00000 0.00000Venezuela 2004 MAN 4379 0.00319 0.04746Venezuela 2000 MAN 3219 0.00644 0.08111Venezuela 1998 MAN 3388 0.23056 0.47905

Table 5.8: The second digit proportions analysis of the loser for the set of historicalelections.Number of observed values, p-value and probability null bound is shown.Note that p-values should be smaller than 1

efor the bound to be valid.

Diff 1 Elections Observed p-values P (H0|data)Puerto Rico 1992 104 0.17462 0.45306Puerto Rico 1996 1870 0.01084 0.11761Puerto Rico 2000 1907 0.28767 0.49349Puerto Rico 2004 1992 0.00592 0.07627Venezuela 2004 Audit 192 0.03596 0.24531Venezuela 2004 AUTO 19017 0.00000 0.00000Venezuela 2000 AUTO 6873 0.00000 0.00000Venezuela 1998 AUTO 16606 0.00000 0.00000Venezuela 2004 MAN 4604 0.00000 0.00000Venezuela 2000 MAN 3611 0.00063 0.01247Venezuela 1998 MAN 3495 0.00000 0.00007

Table 5.9: The first digit proportions of the distance between the winner and the loserfor the set of historical elections. Number of observed values, p-value and probabilitynull bound is shown. Note that p-values should be smaller than 1

efor the bound to

be valid.

64

Diff 2 Elections Observed p-values P (H0|data)Puerto Rico 1992 104 0.90511 0.19697Puerto Rico 1996 1660 0.34637 0.49956Puerto Rico 2000 1720 0.16067 0.44399Puerto Rico 2004 1652 0.49828 0.48547Venezuela 2004 Audit 189 0.21312 0.47245Venezuela 2004 AUTO 18321 0.16820 0.44904Venezuela 2000 AUTO 6745 0.00150 0.02579Venezuela 1998 AUTO 15696 0.00000 0.00000Venezuela 2004 MAN 4377 0.01819 0.16533Venezuela 2000 MAN 3219 0.03547 0.24353Venezuela 1998 MAN 2954 0.12831 0.41730

Table 5.10: The second digit proportions of the distance between the winner andthe loser for the set of historical elections.Number of observed values, p-value andprobability null bound is shown. Note that p-values should be smaller than 1

efor the

bound to be valid.

Total Elections Observed p-values P (H0|data)Puerto Rico 1992 104 0.12573 0.41476Puerto Rico 1996 1867 0.13460 0.42322Puerto Rico 2000 1898 0.48927 0.48737Puerto Rico 2004 1981 0.42806 0.49680Venezuela 2004 Audit 192 0.00000 0.00000Venezuela 2004 AUTO 19064 0.00000 0.00000Venezuela 2000 AUTO 6877 0.00000 0.00000Venezuela 1998 AUTO 16647 0.00000 0.00000Venezuela 2004 MAN 4599 0.11997 0.40882Venezuela 2000 MAN 3589 0.17612 0.45396Venezuela 1998 MAN 4597 0.01235 0.12853

Table 5.11: The second digit proportions of the sum between the winner and the loserfor the set of historical elections. Number of observed values, p-value and probabilitynull bound is shown. Note that p-values should be smaller than 1

efor the bound to

be valid.

65

1 2 3 4 5 6 7 8 90

0.05

0.1

0.15

0.2

0.25

0.3

Digits

Pro

po

rtio

ns

1st Digit LawBush votes proportions

(a)Bushs digit proportions vs N-B Law for the 1st digit.

1 2 3 4 5 6 7 8 90

0.05

0.1

0.15

0.2

0.25

0.3

Digits

Pro

po

rtio

ns

1st Digit LawKerry votes proportions

(b) Kerrys digit proportions vs N-B Law for the 1st digit.

1 2 3 4 5 6 7 8 90

0.05

0.1

0.15

0.2

0.25

0.3

Digits

Pro

po

rtio

ns

1st Digit LawNader votes proportions

(c) Naders digit proportions vs N-B Law for the 1st digit.

Figure 5.1: Presidential election analysis using electoral college votes compare withN-B Law for the 1st digit.

66

0 1 2 3 4 5 6 7 8 90

0.05

0.1

0.15

0.2

0.25

Digits

Pro

po

rtio

ns

2nd Digit LawBush votes proportions

(a)Bushs digit proportions vs N-B Law for the 2nd digit.

0 1 2 3 4 5 6 7 8 90

0.05

0.1

0.15

0.2

0.25

Digits

Pro

po

rtio

ns

2nd Digit LawKerry votes proportions

(b) Kerrys digit proportions vs N-B Law for the 2nd digit.

0 1 2 3 4 5 6 7 8 90

0.05

0.1

0.15

0.2

0.25

Digits

Pro

po

rtio

ns

2nd Digit LawNader votes proportions

(c) Naders digit proportions vs N-B Law for the 2nd digit.

Figure 5.2: Presidential election analysis using electoral college votes compare withN-B Law for the 2nd digit.

67

0 1 2 3 4 5 6 7 8 90.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14Newcomb−Benford 2nd Law and GOB−PNP 1996.

Digits

Pro

po

rtio

ns

2nd Digit Law.PNP Proportions

(a)Puerto Rico Elections 1996 PNP Party.

0 1 2 3 4 5 6 7 8 90.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14Newcomb−Benford Second Digit Law and GOB−PPD 1996.

Digits

Pro

po

rtio

ns

2nd Digit Law.PPD Proportions

(b) Puerto Rico Elections 1996 PPD Party.

0 1 2 3 4 5 6 7 8 90.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14Newcomb−Benford Second Digit Law and GOB−PIP 1996.

Digits

Pro

po

rtio

ns

2nd Digit Law.PIP Proportions

(c) Puerto Rico Elections 1996 PIP Party.

Figure 5.3: Puerto Rico 2096 Elections compare with the Newcomb Benford Law forthe second digit.

68

0 1 2 3 4 5 6 7 8 90.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14Newcomb−Benford Second Digit Law and GOB−PNP 2000.

Digits

Pro

po

rtio

ns



0 1 2 3 4 5 6 7 8 90.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13


Digits

Pro

po

rtio

ns



0 1 2 3 4 5 6 7 8 90.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13


Digits

Pro

po

rtio

ns



Figure 5.4: Puerto Rico 2000 Elections compare with the Newcomb Benford Law forthe second digit.

69

0 1 2 3 4 5 6 7 8 90.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14Newcomb−Benford Second Digit Law and GOB−PNP 2004.

Digits

Pro

po

rtio

ns



0 1 2 3 4 5 6 7 8 90.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13


Digits

Pro

po

rtio

ns



0 1 2 3 4 5 6 7 8 90.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13


Digits

Pro

po

rtio

ns



Figure 5.5: Puerto Rico 2004 Elections compare with the Newcomb Benford Law forthe first digit.

70

0 1 2 3 4 5 6 7 8 90

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2Newcomb−Benford Second Digit Law and RR Manual SI Vote.

Digits

Pro

po

rtio

ns

2st Digit LawManual SI proportions

(a)Venezuela Revocatory Referendum Manual SI Votes proportions.

0 1 2 3 4 5 6 7 8 90

0.05

0.1

Newcomb−Benford Second Digit Law and RR Manual NO Vote.

Digits

Pro

po

rtio

ns

2st Digit LawManual NO proportions

(b)Venezuela Revocatory Referendum Manual NO Votes proportions

Figure 5.6: Venezuela Revocatory Referendum Manual Votes Proportions comparedwith the Newcomb Benford Law’s proportions for the Second Digit.

71

0 1 2 3 4 5 6 7 8 90

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2Newcomb−Benford Second Digit Law and RR Electronic SI Vote.

Digits

Pro

po

rtio

ns

2st Digit LawElectronic SI proportions

(a)Venezuela Revocatory Referendum Electronic SI Votes proportions.

0 1 2 3 4 5 6 7 8 90

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2Newcomb−Benford Second Digit Law and RR’s Electronic NO Vote.

Digits

Pro

po

rtio

ns

2st Digit LawElectronic NO proportions

(b)Venezuela Revocatory Referendum Electronic NO Votes proportions

Figure 5.7: Venezuela Revocatory Referendum Manual Votes Proportions comparedwith the Newcomb Benford Law’s proportions for Second digit.

72

5.4 Conclusions

Some of the conclusions about the topics discussed here are classified by region:

USA The agreement with the Newcomb Benford Law is outstanding. See also figure

5.2 and 5.1 and the respectively the table 5.3 present a summary of those results.

Puerto Rico It’s also impressive the agreement between Newcomb - Benford Law

and the results in each of the elections. Results are show in table 5.3 and table

5.3.

Venezuela The situation is more complex, most of the elections have two types of

vote. The electronic vote mode and the manual vote mode. As the results show

the agrement whit the Newcomb Benford Law is present in the manual polls.

There is a plausible disagree between the electronic vote and the manual vote

results. Results are shown in tables 5.3 and table 5.3

There are some discordance with the electronic voting system and the Newcomb

Benford Law. More studies has to be done over the influence of bounds in the elec-

tronic voting system and the Newcomb Benford Law. The discrepancies cast doubts

on electronic voting, particularly when there is not universal verification after the

polls station loses and prior to the sending of results.

73

0 1 2 3 4 5 6 7 8 90

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2Newcomb−Benford Second Digit Law and Total RR’s Electronic Vote.

Digits

Pro

po

rtio

ns

2st Digit LawTotal proportions

(a)Venezuela Revocatory Referendum Electronic SI Votes proportions.

0 1 2 3 4 5 6 7 8 90

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2Newcomb−Benford Second Digit Law and Total RR’s Manual Vote.

Digits

Pro

po

rtio

ns


(b)Venezuela Revocatory Referendum Manual Total Votes.

Figure 5.8: Venezuela Revocatory Referendum Electronic and Manual Votes Propor-tions compared with the Newcomb Benford Law’s for the second digit proportions.

74

0 1 2 3 4 5 6 7 8 90

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2Newcomb−Benford Second Digit Law and Total RR’s Electronic Vote.

Digits

Pro

po

rtio

ns


(a)Venezuela Revocatory Referendum Electronic Votes second digit proportions.

1 2 3 4 5 6 7 8 90

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Newcomb−Benford First Digit Law and Difference between RR’s Electronic votes SI and NO.

Digits

Pro

po

rtio

ns

1st Digit LawDifference proportions

(b)Venezuela Revocatory Referendum Electronic Distance first digit proportions.

Figure 5.9: Venezuela Revocatory Referendum Manual Distance between the winnerand loser Proportions compared with the Newcomb Benford Law’s for the proportions.

Chapter 6

Appendix: MATLABPROGRAMS.

6.1 Matlab Codes

The code related to get the first and second digit proportions is given below:

function [x c p]=benford(b,dig)

%Phase 1 : Calculate a matrix that separate each digit

% of a given vector of data.

b = b(find(b>0)); count = 1; while max(mod(b,10))>0

tmp(count,:) = mod(b,10);

b = (b-tmp(count,:))/10;

count = count + 1;

end C=tmp’; [n,m]=size(C);

%Phase 2: Calculate the digit proportions in the data 1,2.

switch dig

case 1 % Newcomb Benford for the first digit.

i=1;

75

76

while i<=n

temp(i,1)= C(i,max(find(C(i,:))));

i=i+1;

end

p = temp’;

c = hist(temp,1:1:9);%frequencies primer digit

x = c./sum(c);%proportions first digit

case 2 % Newcomb Benford for the second digit.

i=1;

while i<=n

temp_0 =max(find(C(i,:)))-1;

if temp_0 > 0

temp_1(i,1)= C(i,temp_0);

else

temp_1(i,1)= -1;

end

i=i+1;

end

h=1; j=1;

while h<=n

if temp_1(h)~=-1

p(j)=temp_1(h);

j=j+1;

77

end

h=h+1;

end

c = hist(p,0:1:9);%frequencies

x = c./sum(c);%proportions

otherwise

disp(’Error. Please verify your data.’)

end

The following code represent the calculations of the hypothesis test.

function [H] = newbenchi2(LLL)

%Seting the temporary variables

k = length(LLL); p0=zeros(1,k); kk=2*(k-1); nn=zeros(1,k);

LgL=zeros(1,k); nn = LLL; NN=sum(nn);

%Choose if the test will be for the first digit or the second digit.

if k==9

%First Significant Digit Newcomb-Benford’s Law

j = 1:9;

p0 = (log10(1+1./j))’;

else

for i=1:1:10

SS=0.0;

for j=1:1:9

SS=SS+log10(1+1/(10*j+(i-1)));

end

78

p0(i)=SS;

end

p0=p0’;

end LgL = nn’.*(log(p0)-log(nn./NN))’;

LgBIC=sum(LgL)+((k-1)/2)*log(NN); PrBIC=1/(1+exp(-LgBIC)); LgL10 =

gammaln(nn + 1) - nn.*log(p0); LgB10 = sum(LgL10’)- gammaln(NN +

k)+ gammaln(k); PrH0data = 1/(exp(LgB10)+1); SSQ =

sum(((nn-NN.*p0).^2)./(NN.*p0)); Chicuadrado=SSQ;

Pvalue=1-chi2cdf(SSQ,k-1);

%If the p-value is too small the the choice is cero.

if (Pvalue < 10^-15)

ProbabilityNullBound = 0;

elseif (Pvalue >= 10^-15)

ProbabilityNullBound=exp(1)*Pvalue*log(Pvalue)/(exp(1)*

Pvalue*log(Pvalue)-1.);

end

LgAPPROX=((NN-kk+1)/NN)*sum(LgL)+((k-1)/2)*log(NN/(kk-1));

PrApprox=1/(1+exp(-LgAPPROX)); Lgft = nn.*log(p0); Lgst =

gammaln((nn + 1))- gammaln(((nn*(k-1)+ NN)/NN));

LgBF01=((NN-k+1)/NN)*sum(Lgft)-sum(Lgst)+gammaln((NN+k))-gammaln((2*k-1));

PrFRAC=1/(1+exp(-LgBF01));

%Output

H = [NN PrH0data PrApprox PrFRAC PrBIC Chicuadrado Pvalue

ProbabilityNullBound];

Bibliography

Bayarri M.J. Berger O. J., Sellke T. Calibration of p-values for testing precise null

hypotheses. The American Statistician, 55:62–71, 2001.

Louis Thomas A. Carlin Bradley P. Bayes and Empirical Bayes Methods For Data

Analysis. Text In Statistical Science Series. Chapman Hall, New York, 2000.

Knuth Donald E. The Art of Computer Programming, volume 2 of Addison-Wesley

Series in Computer Science and Information Processing. Addison-Wesley Publish-

ing Company, Philippines, 1981.

Knuth Donald E. The TEXbook. Addison-Wesley, 1984.

Ley Eduardo. On the Peculiar Distribution of the U.S. Stock Indexes’ Digits. The

American Statistician, 50(4):311–313, Nov 1996.

Benford Frank. The law of anomalous numbers. Proc. of the American Philosophical

Society, 78:551–572, 1938.

Carter Center. Observing the Venezuela Presidential Recall Referendum: Compre-

hensive Report. 2005.

J. B. Stern H. S. Rubin D. B. Gelman A., Carlin. Bayesian Data Analysis. Chapman

& Hall Ltd, 1995.

80

81

R. Hamming. On the distribution of numbers. Bell System Technical Journal, (49):

1609–1625, 1970.

Jupp Ray Brass Andrew Hoyle David C., Rattray Magnus. Making sence of microar-

ray data distribution. Bioinformatics, 18(4):576–584, Nov 2002.

Berger J.O. The generalize intrinsic bayes factor. Technical Report, SAMSI, Depart-

ment of Mathematics, Statistics, & Computing Science, 1991.

Berger J.O. Statistics Decision Theory and Bayesian Analysis, page 237. Second

edition.

Berger J.O. and Pericchi L R. Objective Bayesian methods for model selec-

tion:Introduction and comparison (with discussion), pages 135–207. Institute of

Mathematical Statistics, Monographs, Beachwood OH, 2001.

King Gary Katz Jonathan. A statistical model for multiparty electoral data. American

Political Science Rcience Review, 93(1):15–32, 1999.

A. Vespignani L. Pietronero, E. Tosatti. Explaining the uneven distribution of num-

bers in nature: the laws of benford and zipf. Physica A, 293:297–304, Nov 2001.

Lamport Leslie. LATEX: A Document Preparation System. Addison-Wesley, 1986.

Loeve M. Probability Theory, volume 1. Springer, New York, fourth edition, 1977.

Speed T. P. Statistical Analysis of Gene Expression Microarray Data. Chapman &

Hall/CRC, Boca Raton, Florida, 2003.

Newcomb Simon. Note on the frequency of use of the Different Digits in Natural

Numbers. Amer. J. of Math., 4(1):39–40, Nov 1881.

Hill Theodore. Base-invariance implies benford’s law. Proceedings of the American

Mathematical Society, 123(3):887–895, Mar 1995a.

82

Hill Theodore. A statistical derivation of the Significant-Digit Law. Statistical Science,

10(4):354–363, 1996.

Pericchi, L. R. and David Torres. La Ley de Newcomb-Benford y sus aplicaciones al

Referendum Revocatorio en Venezuela. Reporte Tecnico no-definitivo 2a,Octubre

01,2004.

H. G. Ortiz-Zuazaga Y. Felix S. Pea de Ortiz Y. Robles, P. E. Vivas. Hippocampal

gene expression profiling in spatial learning. Neurobiology of Learning and Memory,

80(1):80–95, 2003.

Dudoit S.-Speed T. P. Yang Y. H., Bucley M. J. Comparison of methods for im-

age analysis on cdna microarray data. Journal of Computational and Graphical

Statistics, 11:1–29, 2002.

newcomb-benford’s law applications to electoral processes, bioinformatics, and the stock index

Education