package ‘rebmix’ · demix 11 arguments x a vector, a matrix or a data frame containing...

42
Package ‘rebmix’ February 15, 2013 Version 2.5.0 Title The Rebmix Package Author Marko Nagode Maintainer Marko Nagode <[email protected]> Description R functions for random univariate and multivariate finite mixture generation, number of components, component weights and component parameters estimation, printing and plotting of finite mixtures, bootstrapping and class membership prediction. Variables can be either continuous or discrete, may follow normal, lognormal, Weibull, gamma, binomial, Poisson or Dirac parametric families and should be independent within components. License GPL (>= 2) Suggests mixtools, mclust Depends R (>= 2.10) Repository CRAN Date/Publication 2013-01-25 15:57:57 NeedsCompilation yes R topics documented: adult ............................................. 2 AIC ............................................. 5 AWE ............................................. 6 BIC ............................................. 6 boot ............................................. 7 CLC ............................................. 8 coef.REBMIX ........................................ 9 demix ............................................ 10 dfmix ............................................ 12 1

Upload: others

Post on 01-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

Package ‘rebmix’February 15, 2013

Version 2.5.0

Title The Rebmix Package

Author Marko Nagode

Maintainer Marko Nagode <[email protected]>

Description R functions for random univariate and multivariate finitemixture generation, number of components, component weights andcomponent parameters estimation, printing and plotting offinite mixtures, bootstrapping and class membership prediction.Variables can be either continuous or discrete, may follownormal, lognormal, Weibull, gamma, binomial, Poisson or Diracparametric families and should be independent within components.

License GPL (>= 2)

Suggests mixtools, mclust

Depends R (>= 2.10)

Repository CRAN

Date/Publication 2013-01-25 15:57:57

NeedsCompilation yes

R topics documented:adult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5AWE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7CLC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8coef.REBMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9demix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10dfmix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1

Page 2: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

2 adult

galaxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14HQC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15ICL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15ICLBIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16logL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16MDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17pemix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17pfmix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18plot.REBMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20PRD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22predict.list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23print.boot.REBMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27print.REBMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28print.RNGMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29REBMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30RNGMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36summary.boot.REBMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36summary.REBMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37weibull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38wine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Index 41

adult Adult Dataset

Description

The adult dataset containing 48842 instances with 16 continuous, binary and discrete variableswas extracted from the census bureau database http://www.census.gov/. Extraction was doneby Barry Becker from the 1994 census bureau database.

Usage

adult

Format

adult is a data frame with 48842 cases (rows) and 16 variables (columns) named:

1. Type binary train or test.

2. Age continuous.

3. Workclass one of the 8 discrete values private, self-emp-not-inc, self-emp-inc, federal-gov,local-gov, state-gov, without-pay or never-worked.

4. Fnlwgt stands for continuous final weight.

Page 3: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

adult 3

5. Education one of the 16 discrete values bachelors, some-college, 11th, hs-grad, prof-school,assoc-acdm, assoc-voc, 9th, 7th-8th, 12th, masters, 1st-4th, 10th, doctorate, 5th-6thor preschool.

6. Education.Num continuous.

7. Marital.Status one of the 7 discrete values married-civ-spouse, divorced, never-married,separated, widowed, married-spouse-absent or married-af-spouse.

8. Occupation one of the 14 discrete values tech-support, craft-repair, other-service,sales, exec-managerial, prof-specialty, handlers-cleaners, machine-op-inspct, adm-clerical,farming-fishing, transport-moving, priv-house-serv, protective-serv or armed-forces.

9. Relationship one of the 6 discrete values wife, own-child, husband, not-in-family,other-relative or unmarried.

10. Race one of the 5 discrete values white, asian-pac-islander, amer-indian-eskimo, otheror black.

11. Sex binary female or male.

12. Capital.Gain continuous.

13. Capital.Loss continuous.

14. Hours.Per.Week continuous.

15. Native.Country one of the 41 discrete values united-states, cambodia, england, puerto-rico,canada, germany, outlying-us(guam-usvi-etc), india, japan, greece, south, china,cuba, iran, honduras, philippines, italy, poland, jamaica, vietnam, mexico, portugal,ireland, france, dominican-republic, laos, ecuador, taiwan, haiti, columbia, hungary,guatemala, nicaragua, scotland, thailand, yugoslavia, el-salvador, trinadad&tobago,peru, hong or holand-netherlands.

16. Income binary <=50k or >50k.

Source

Frank A, Asuncion A (2010). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.

References

Frank A, Asuncion A (2010). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.

Examples

data("adult")

## Find complete cases.

adult <- adult[complete.cases(adult), ]

## Map metric attributes.

adult[["Capital.Loss"]] <- ordered(cut(adult[["Capital.Loss"]], 2000))adult[["Capital.Gain"]] <- ordered(cut(adult[["Capital.Gain"]], 2000))

Page 4: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

4 adult

## Show level attributes for binary and discrete variables.

levels(adult[["Type"]])levels(adult[["Workclass"]])levels(adult[["Education"]])levels(adult[["Marital.Status"]])levels(adult[["Occupation"]])levels(adult[["Relationship"]])levels(adult[["Race"]])levels(adult[["Sex"]])levels(adult[["Native.Country"]])levels(adult[["Income"]])

## Replace levels with numbers.

adult <- as.data.frame(data.matrix(adult))

## Levels should start with 0 for discrete distributions except for the## Dirac distribution.

f <- c("Type", "Workclass", "Education", "Marital.Status", "Occupation","Relationship", "Race", "Sex", "Native.Country", "Income")

adult[, f] <- adult[, f] - 1

## Split adult dataset into two train subsets for the two Incomes## and remove Type and Income columns.

trainle50k <- subset(adult, subset = (Type == 1) & (Income == 0),select = c(-Type, -Income))

traingt50k <- subset(adult, subset = (Type == 1) & (Income == 1),select = c(-Type, -Income))

trainall <- subset(adult, subset = Type == 1, select = c(-Type, -Income))

train <- as.factor(subset(adult, subset = Type == 1, select = c(Income))[, 1])

## Extract test dataset form adult dataset and remove Type## and Income columns.

testle50k <- subset(adult, subset = (Type == 0) & (Income == 0),select = c(-Type, -Income))

testgt50k <- subset(adult, subset = (Type == 0) & (Income == 1),select = c(-Type, -Income))

testall <- subset(adult, subset = Type == 0, select = c(-Type, -Income))

test <- as.factor(subset(adult, subset = Type == 0, select = c(Income))[, 1])

save(trainall, file = "trainall.rda")save(testall, file = "testall.rda")

Page 5: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

AIC 5

AIC Akaike Information Criterion

Description

Returns the Akaike information criterion at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’AIC(object, pos = 1, ...)## S3 method for class ’REBMIX’AIC3(object, pos = 1, ...)## S3 method for class ’REBMIX’AIC4(object, pos = 1, ...)## S3 method for class ’REBMIX’AICc(object, pos = 1, ...)## S3 method for class ’REBMIX’CAIC(object, pos = 1, ...)

Arguments

object an object of class REBMIX.

pos a desired row number in summary for which the information criterion is calcu-lated. The default value is 1.

... currently not used.

References

Akaike H (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Auto-matic Control, 19, 716-723.

Smith AFM, Spiegelhalter DJ (1980). Bayes Factors and Choice Criteria for Linear Models. Jour-nal of the Royal Statistical Society B, 42, 213-220.

Hurvich CM, Tsai CL (1989). Regression and Time Series Model Selection in Small Samples.Biometrika, 76, 297-307.

Bozdogan H (1987). Model Selection and Akaike’s Information Criterion (AIC): The GeneralTheory and its Analytical Extensions. Psychometrika, 52, 345-370.

Page 6: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

6 BIC

AWE Approximate Weight of Evidence Criterion

Description

Returns the approximate weight of evidence criterion at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’AWE(object, pos = 1, ...)

Arguments

object an object of class REBMIX.pos a desired row number in summary for which the information criterion is calcu-

lated. The default value is 1.... currently not used.

References

Banfield JD, Raftery AE (1993). Model-Based Gaussian and Non-Gaussian Clustering. Biometrics,49, 803-821.

BIC Bayesian Information Criterion

Description

Returns the Bayesian information criterion at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’BIC(object, pos = 1, ...)

Arguments

object an object of class REBMIX.pos a desired row number in summary for which the information criterion is calcu-

lated. The default value is 1.... currently not used.

References

Schwarz G (1978). Estimating the Dimension of a Model. Annals of Statistics, 6, 461-464.

Page 7: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

boot 7

boot Parametric or Nonparametric Bootstrap for Standard Error and Coef-ficient of Variation Estimation

Description

Returns the boot.REBMIX output for mixtures of conditionally independent normal, lognormal,Weibull, gamma, binomial, Poisson or Dirac component densities.

Usage

## S3 method for class ’REBMIX’boot(x, pos = 1, Bootstrap = "parametric",

B = 100, n = NULL, replace = TRUE, prob = NULL, ...)

Arguments

x an object of class REBMIX.

pos a desired row number in x$summary to be bootstrapped. The default value is 1.

Bootstrap a character vector giving the bootstrap type. One of default "parametric" or"nonparametric".

B number of bootstrap samples. The default value is 100.

n number of observations. The default value is NULL.

replace logical. The sampling is with replacement if TRUE, see also sample. The defaultvalue is TRUE.

prob a vector of length n containing probability weights, see also sample. The defaultvalue is NULL.

... further arguments to sample.

Value

c a data frame containing numbers of components for B bootstrap samples.

c.se standard error of numbers of components c.

c.cv coefficient of variation of numbers of components c.

c.mode mode of numbers of components c.

c.prob probability of mode c.mode.

w a data frame containing component weights for ≤ B bootstrap samples.

w.se a vector containing standard errors of component weights w.

w.cv a vector containing coefficients of variation of component weights w.

theta1.i data frames containing component parameters theta1.i for ≤ B bootstrapsamples.

theta1.i.se vectors containing standard errors of component parameters theta1.i.

Page 8: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

8 CLC

theta1.i.cv vectors containing coefficients of variation of component parameters theta1.i.

theta2.i data frames containing component parameters theta2.i for ≤ B bootstrapsamples.

theta2.i.se vectors containing standard errors of component parameters theta2.i.

theta2.i.cv vectors containing coefficients of variation of component parameters theta2.i.

References

McLachlan GJ, Peel D (2000). Finite Mixture Models. John Wiley & Sons, New York.

Examples

data("weibull")

n <- nrow(weibull)

## Number of classes or nearest neighbours to be processed.

K <- c(as.integer(1 + log2(sum(n))), ## Minimum v follows the Sturges rule.as.integer(10 * log10(n))) ## Maximum v follows the log10 rule.

## Estimate number of components, component weights and component parameters.

weibullest <- REBMIX(Dataset = list(weibull),Preprocessing = "Parzen window",D = 0.025,cmax = 4,Criterion = "BIC",Variables = "continuous",pdf = "Weibull",K = K[1]:K[2],b = 1.0,Restraints = "loose")

## Plot finite mixture.

plot(weibullest, what = c("density", "distribution", "c", "IC", "logL", "D"),nrow = 3, ncol = 2, npts = 1000)

## Bootstrap finite mixture.

weibullboot <- boot(x = weibullest, Bootstrap = "nonparametric")

weibullboot

CLC Classification Likelihood Criterion

Page 9: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

coef.REBMIX 9

Description

Returns the classification likelihood criterion at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’CLC(object, pos = 1, ...)

Arguments

object an object of class REBMIX.

pos a desired row number in summary for which the information criterion is calcu-lated. The default value is 1.

... currently not used.

References

Biernacki C, Govaert G (1997). Using the Classification Likelihood to Choose the Number ofClusters. Computing Science and Statistics, 29, 451-457.

coef.REBMIX Prints Univariate or Multivariate REBMIX Coefficients

Description

Returns the w and Theta output at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’coef(object, pos = 1, ...)

Arguments

object an object of class REBMIX.

pos a desired row number in summary to be printed. The default value is 1.

... further arguments to print.

Page 10: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

10 demix

Examples

## Generate and print simulated dataset.

Theta <- rbind(pdf = rep("Weibull", 3),theta1 = c(2.0, 10.0, 30),theta2 = c(2.0, 4.0, 7.0))

simulated <- RNGMIX(Dataset = "simulated",rseed = -1,n = c(40, 60, 50),Theta = Theta)

simulated

## Estimate number of components, component weights and component parameters.

simulatedest <- REBMIX(Dataset = simulated$Dataset,Preprocessing = "histogram",D = 0.025,cmax = 6,Criterion = "AIC",Variables = "continuous",pdf = "Weibull",K = 8:25,b = 1.0)

## Print coefficients and plot finite mixture.

coef(simulatedest)

plot(simulatedest)

demix Empirical Density Calculation

Description

demix returns (invisibly) the data frame containing observations x1, . . . ,xn and empirical densi-ties f1, . . . , fn for the Parzen window or k-nearest neighbour or bin means x1, . . . , xv and em-pirical densities f1, . . . , fv for the histogram preprocessing. Vectors x and x are subvectors ofy = (y1, . . . , yd)> and y = (y1, . . . , yd)>.

Usage

demix(x = NULL, Preprocessing = NULL, Variables = NULL,k = NULL, xmin = NULL, xmax = NULL, ...)

Page 11: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

demix 11

Arguments

x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x.

Preprocessing a preprocessing type. One of "histogram", "Parzen window" or "k-nearest neighbour".

Variables a character vector of length≤ d containing types of variables. One of "continuous"or "discrete".

k a number of bins v for the histogram and the Parzen window or number ofnearest neighbours k for the k-nearest neighbour.

xmin a vector of length ≤ d containing minimum observations. The default value isNULL.

xmax a vector of length ≤ d containing maximum observations. The default value isNULL.

... currently not used.

References

Nagode M, Fajdiga M (2011). The REBMIX Algorithm for the Univariate Finite Mixture Estima-tion. Communications in Statistics - Theory and Methods, 40(5), 876-892.

Nagode M, Fajdiga M (2011). The REBMIX Algorithm for the Multivariate Finite Mixture Es-timation. Communications in Statistics - Theory and Methods, 40(11), 2022-2034.

Examples

## Generate simulated dataset.

Theta <- rbind(pdf1 = rep("normal", 2),theta1.1 = c(10, 20),theta2.1 = c(3.0, 2.0),pdf1 = rep("normal", 2),theta1.1 = c(3, 2),theta2.1 = c(20, 10))

simulated <- RNGMIX(Dataset = "simulated",rseed = -1,n = c(15, 15),Theta = Theta)

## Preprocess simulated dataset.

y1y2f <- demix(x = simulated$Dataset[[1]],Preprocessing = "histogram",Variables = c("continuous", "continuous"),k = 6)

y1y2f

Page 12: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

12 dfmix

dfmix Predictive Density Calculation

Description

dfmix returns (invisibly) the vector containing predictive densities f(x|c,w,Θ), where x standsfor a subvector of y = (y1, . . . , yd)>.

Usage

dfmix(x = NULL, w = NULL, Theta = NULL, ...)

Arguments

x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x.

w a vector or a data frame containing c component weights wl summing to 1.

Theta a matrix or a data frame containing c parametric family types pdfi. One of"normal", "lognormal", "Weibull", "gamma", "binomial", "Poisson" or"Dirac". Component parameters theta1.i follow the parametric family types.One of µil for normal and lognormal distributions and θil for Weibull, gamma,binomial, Poisson and Dirac distributions. Component parameters theta2.ifollow theta1.i. One of σil for normal and lognormal distributions, βil forWeibull and gamma distributions and pil for binomial distribution.

... currently not used.

References

Nagode M, Fajdiga M (2011). The REBMIX Algorithm for the Univariate Finite Mixture Estima-tion. Communications in Statistics - Theory and Methods, 40(5), 876-892.

Nagode M, Fajdiga M (2011). The REBMIX Algorithm for the Multivariate Finite Mixture Es-timation. Communications in Statistics - Theory and Methods, 40(11), 2022-2034.

Examples

## Generate simulated dataset.

Theta <- rbind(pdf1 = rep("normal", 2),theta1.1 = c(10, 20),theta2.1 = c(3.0, 2.0),pdf1 = rep("Weibull", 2),theta1.1 = c(3, 2),theta2.1 = c(20, 10))

simulated <- RNGMIX(Dataset = "simulated",rseed = -1,

Page 13: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

dfmix 13

n = c(15, 25),Theta = Theta)

## Estimate number of components, component weights and component parameters.

simulatedest <- REBMIX(Dataset = simulated$Dataset,Preprocessing = "Parzen window",D = 0.025,cmax = 4,Criterion = "BIC",Variables = c("continuous", "continuous"),pdf = c("normal", "Weibull"),K = 8,b = 0.0)

## Preprocess and plot finite mixtures.

opar <- plot(simulatedest)

par(opar)

y1f <- demix(x = simulatedest$Dataset[[1]][, 1, drop = FALSE],Preprocessing = "Parzen window",Variables = "continuous",k = 8)

y1 <- seq(from = min(y1f[, 1]), to = max(y1f[, 1]), length.out = 200)

f1 <- dfmix(x = cbind(y1), w = simulatedest$w[[1]], simulatedest$Theta[[1]][1:3,])

y2f <- demix(x = simulatedest$Dataset[[1]][, 2, drop = FALSE],Preprocessing = "Parzen window",Variables = "continuous",k = 8)

y2 <- seq(from = min(y2f[, 1]), to = max(y2f[, 1]), length.out = 200)

f2 <- dfmix(x = cbind(y2), w = simulatedest$w[[1]], simulatedest$Theta[[1]][4:6,])

opar <- par(mfrow = c(1, 2))

plot(y1, f1, xlab = bquote(y[1]), ylab = bquote(f(y[1])), type = "l", col = "blue")

points(y1f, pch = 3, col = "red")

plot(y2, f2, xlab = bquote(y[2]), ylab = bquote(f(y[2])), type = "l", col = "blue")

points(y2f, pch = 3, col = "red")

par(opar)

Page 14: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

14 galaxy

galaxy Galaxy Dataset

Description

The unfilled survey of the Corona Borealis region contains the velocities of 82 galaxies from 6 wellseparated conic sections of space.

Usage

galaxy

Format

galaxy is a data frame with 82 cases (rows) and 1 continuous variable (columns) called Velocity.

Source

Roeder K (1990). Density Estimation with Confidence Sets Exemplified by Superclusters and Voidsin the Galaxies. Journal of American Statistical Association, 85, 617-624.

References

Richardson S, Green PJ (1997). On Bayesian Analysis of Mixtures with an Unknown Number ofComponents. Journal of the Royal Statistical Society B, 59, 731-792.

McLachlan GJ, Peel D (1997). Contribution to the Discussion of Paper by S. Richardson andP.J. Green. Journal of the Royal Statistical Society B, 59, 779-780.

Stephens M (2000). Bayesian Analysis of Mixture Models with an Unknown Number of Com-ponents - An Alternative to Reversible Jump Methods. The Annals of Statistics, 28, 40-74.

Examples

data("galaxy")

colnames(galaxy)

## Write dataset into tab delimited ASCII file.

write.table(galaxy, file = "galaxy.txt", sep = "\t",eol = "\n", row.names = FALSE, col.names = FALSE)

## Write dataset into rda file.

save(galaxy, file = "galaxy.rda")

Page 15: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

HQC 15

HQC Hannan-Quinn Information Criterion

Description

Returns the Hannan-Quinn information criterion at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’HQC(object, pos = 1, ...)

Arguments

object an object of class REBMIX.pos a desired row number in summary for which the information criterion is calcu-

lated. The default value is 1.... currently not used.

References

Hannan EJ, Quinn BG (1979). The Determination of the Order of an Autoregression. Journal ofthe Royal Statistical Society B, 41, 190-195.

ICL Integrated Classification Likelihood Criterion

Description

Returns the integrated classification likelihood criterion at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’ICL(object, pos = 1, ...)

Arguments

object an object of class REBMIX.pos a desired row number in summary for which the information criterion is calcu-

lated. The default value is 1.... currently not used.

References

Biernacki C, Celeux G, Govaert G (1998). Assessing a Mixture Model for Clustering with theIntegrated Classification Likelihood. Technical Report 3521, INRIA, Rhone-Alpes.

Page 16: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

16 logL

ICLBIC Approximate Integrated Classification Likelihood Criterion

Description

Returns the approximate integrated classification likelihood criterion at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’ICLBIC(object, pos = 1, ...)

Arguments

object an object of class REBMIX.pos a desired row number in summary for which the information criterion is calcu-

lated. The default value is 1.... currently not used.

References

Biernacki C, Celeux G, Govaert G (1998). Assessing a Mixture Model for Clustering with theIntegrated Classification Likelihood. Technical Report 3521, INRIA, Rhone-Alpes.

logL Log Likelihood

Description

Returns the log likelihood at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’logL(object, pos = 1, ...)

Arguments

object an object of class REBMIX.pos a desired row number in summary for which the information criterion is calcu-

lated. The default value is 1.... currently not used.

References

McLachlan GJ, Peel D (2000). Finite Mixture Models. John Wiley & Sons, New York.

Page 17: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

MDL 17

MDL Minimum Description Length

Description

Returns the minimum desription length at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’MDL2(object, pos = 1, ...)## S3 method for class ’REBMIX’MDL5(object, pos = 1, ...)

Arguments

object an object of class REBMIX.pos a desired row number in summary for which the information criterion is calcu-

lated. The default value is 1.... currently not used.

References

Liang Z (1992). Parameter Estimation of Finite Mixture Using the EM-Algorithm and InformationCriteria with Applications to Medical Image Processing. IEEE: Nuclear Science, 39(4), 11-26.

pemix Empirical Distribution Function Calculation

Description

pemix returns (invisibly) the data frame containing observations x1, . . . ,xn and empirical distribu-tion functions F1, . . . , Fn. Vectors x are subvectors of y = (y1, . . . , yd)>.

Usage

pemix(x = NULL, lower.tail = TRUE, log.p = FALSE, ...)

Arguments

x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x.

lower.tail logical. If TRUE (default), probabilities are P [X ≤ x], otherwise, P [X > x].log.p logical. if TRUE, probabilities p are given as log(p).... currently not used.

Page 18: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

18 pfmix

References

Maia C (2011). Multivariate Empirical Cumulative Distribution Functions. http://CRAN.R-project.org/package=mecdf.

Examples

## Generate simulated dataset.

Theta <- rbind(pdf1 = rep("normal", 2),theta1.1 = c(10, 20),theta2.1 = c(3.0, 2.0),pdf1 = rep("normal", 2),theta1.1 = c(3, 2),theta2.1 = c(20, 10))

simulated <- RNGMIX(Dataset = "simulated",rseed = -1,n = c(15, 15),Theta = Theta)

## Preprocess simulated dataset.

y1y2F <- pemix(x = simulated$Dataset[[1]])

y1y2F

pfmix Predictive Distribution Function Calculation

Description

pfmix returns (invisibly) the vector containing predictive distribution functionsF (x|c,w,Θ), wherex stands for a subvector of y = (y1, . . . , yd)>.

Usage

pfmix(x = NULL, w = NULL, Theta = NULL, lower.tail = TRUE, log.p = FALSE, ...)

Arguments

x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x.

w a vector or a data frame containing c component weights wl summing to 1.

Theta a matrix or a data frame containing c parametric family types pdfi. One of"normal", "lognormal", "Weibull", "gamma", "binomial", "Poisson" or"Dirac". Component parameters theta1.i follow the parametric family types.One of µil for normal and lognormal distributions and θil for Weibull, gamma,binomial, Poisson and Dirac distributions. Component parameters theta2.i

Page 19: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

pfmix 19

follow theta1.i. One of σil for normal and lognormal distributions, βil forWeibull and gamma distributions and pil for binomial distribution.

lower.tail logical. If TRUE (default), probabilities are P [X ≤ x], otherwise, P [X > x].

log.p logical. if TRUE, probabilities p are given as log(p).

... currently not used.

References

Nagode M, Fajdiga M (2011). The REBMIX Algorithm for the Univariate Finite Mixture Estima-tion. Communications in Statistics - Theory and Methods, 40(5), 876-892.

Nagode M, Fajdiga M (2011). The REBMIX Algorithm for the Multivariate Finite Mixture Es-timation. Communications in Statistics - Theory and Methods, 40(11), 2022-2034.

Examples

## Generate simulated dataset.

Theta <- rbind(pdf1 = rep("normal", 2),theta1.1 = c(10, 20),theta2.1 = c(3.0, 2.0),pdf1 = rep("Weibull", 2),theta1.1 = c(3, 2),theta2.1 = c(20, 10))

simulated <- RNGMIX(Dataset = "simulated",rseed = -1,n = c(15, 25),Theta = Theta)

## Estimate number of components, component weights and component parameters.

simulatedest <- REBMIX(Dataset = simulated$Dataset,Preprocessing = "Parzen window",D = 0.025,cmax = 4,Criterion = "BIC",Variables = c("continuous", "continuous"),pdf = c("normal", "Weibull"),K = 8,b = 0.0)

## Preprocess and plot finite mixtures.

opar <- plot(simulatedest)

par(opar)

y1f <- pemix(x = simulatedest$Dataset[[1]][, 1, drop = FALSE],Preprocessing = "Parzen window",Variables = "continuous",

Page 20: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

20 plot.REBMIX

k = 8)

y1 <- seq(from = min(y1f[, 1]), to = max(y1f[, 1]), length.out = 200)

f1 <- pfmix(x = cbind(y1), w = simulatedest$w[[1]], simulatedest$Theta[[1]][1:3,])

y2f <- pemix(x = simulatedest$Dataset[[1]][, 2, drop = FALSE],Preprocessing = "Parzen window",Variables = "continuous",k = 8)

y2 <- seq(from = min(y2f[, 1]), to = max(y2f[, 1]), length.out = 200)

f2 <- pfmix(x = cbind(y2), w = simulatedest$w[[1]], simulatedest$Theta[[1]][4:6,])

opar <- par(mfrow = c(1, 2))

plot(y1, f1, xlab = bquote(y[1]), ylab = bquote(F(y[1])), type = "l", col = "blue")

points(y1f, pch = 3, col = "red")

plot(y2, f2, xlab = bquote(y[2]), ylab = bquote(F(y[2])), type = "l", col = "blue")

points(y2f, pch = 3, col = "red")

par(opar)

plot.REBMIX Plots Univariate or Multivariate REBMIX Output

Description

Takes an object of class REBMIX and plots empirical and predictive univariate d = 1 mixture den-sities or pairs of empirical and predictive bivariate marginal mixture densities for d > 1. Theempirical densities follow the histogram, Parzen window or k-nearest neighbour approach.

Usage

## S3 method for class ’REBMIX’plot(x, pos = 1, what = c("density"),

nrow = 1, ncol = 1, npts = 200, cex = 0.8, fg = "black",lty = "solid", lwd = 1, pty = "m", tcl = 0.5,plot.cex = 0.8, plot.pch = 19, contour.drawlabels = FALSE,contour.labcex = 0.8, contour.method = "flattest",contour.nlevels = 12, ...)

Arguments

x an object of class REBMIX.

Page 21: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

plot.REBMIX 21

pos a desired row number in x$summary to be ploted. The default value is 1.

what a character vector, giving the plot types. One of "density", "marginal", "c","IC", "logL", "D" or "distribution". The default value is "density".

nrow a desired number of rows in which the empirical and predictive densities are tobe ploted. The default value is 1.

ncol a desired number of columns in which the empirical and predictive densities areto be ploted. The default value is 1.

npts a number of points at which the predictive densities are to be ploted. The defaultvalue is 200.

cex a numerical value giving the amount by which the plotting text and symbolsshould be magnified relative to the default, see also par. The default value is0.8.

fg a colour used for things like axes and boxes around plots, see also par. Thedefault value is "black".

lty a line type, see also par. The default value is "solid".

lwd a line width, see also par. The default value is 1.

pty a character specifying the type of the plot region to be used. One of "s" gen-erating a square plotting region or "m" generating the maximal plotting region.The default value is "m".

tcl a length of tick marks as a fraction of the height of a line of the text, see alsopar. The default value is 0.5.

plot.cex a numerical vector giving the amount by which plotting characters and symbolsshould be scaled relative to the default. It works as a multiple of par("cex").NULL and NA are equivalent to 1.0. Note that this does not affect annotation, seealso plot.default. The default value is 0.8.

plot.pch a vector of plotting characters or symbols, see also points. The default value is19.

contour.drawlabels

logical. The contours are labelled if TRUE. The default value is FALSE.

contour.labcex cex for contour labelling. The default value is 0.8. This is an absolute size, nota multiple of par("cex").

contour.method a character string specifying where the labels will be located. The possible val-ues are "simple", "edge" and default "flattest", see also contour.

contour.nlevels

a number of desired contour levels. The default value is 12.

... further arguments to par.

Value

par is returned in an invisible named list. Such a list can be passed as an argument to par to restorethe parameter values.

References

Bishop CM (1995). Neural Networks for Pattern Recognition. Clarendon Press, Oxford.

Page 22: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

22 PRD

Examples

data("wine")

colnames(wine)

## Remove Cultivar column from wine dataset.

winecolnames <- !(colnames(wine) %in% "Cultivar")

wine <- wine[ , winecolnames]

## Determine number of dimensions d and wine dataset size n.

d <- ncol(wine)n <- nrow(wine)

## Estimate number of components, component weights and component parameters.

v <- c(as.integer(1 + log2(n)), ## Minimum v follows the Sturges rule.as.integer(2 * n^0.5)) ## Maximum v follows the RootN rule.

## Number of classes or nearest neighbours to be processed.

N <- as.integer(log(v[2] / (v[1] + 1)) / log(1 + 1 / v[1]))

K <- c(v[1], as.integer((v[1] + 1) * (1 + 1 / v[1])^(0:N)), v[2])

wineest <- REBMIX(Dataset = list(wine = wine),Preprocessing = "Parzen window",Criterion = "ICL-BIC",Variables = rep("continuous", d),pdf = rep("normal", d),K = K)

## Plot finite mixture.

plot(wineest, what = c("density", "marginal", "c", "IC", "logL", "D","distribution"), nrow = 4, ncol = 5, npts = 500, pty = "s")

PRD Total of Positive Relative Deviations

Description

Returns the total of positive relative deviations D at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’PRD(object, pos = 1, ...)

Page 23: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

predict.list 23

Arguments

object an object of class REBMIX.

pos a desired row number in summary for which the information criterion is calcu-lated. The default value is 1.

... currently not used.

References

Nagode M, Fajdiga M (2011). The REBMIX Algorithm for the Univariate Finite Mixture Estima-tion. Communications in Statistics - Theory and Methods, 40(5), 876-892.

Nagode M, Fajdiga M (2011). The REBMIX Algorithm for the Multivariate Finite Mixture Es-timation. Communications in Statistics - Theory and Methods, 40(11), 2022-2034.

predict.list Predicts Class Membership Based Upon a Model Trained by REBMIX

Description

Takes a list of objects of class REBMIX, prior probabilities P and data frame Dataset with test classesand returns predictive classes.

Usage

## S3 method for class ’list’predict(object,P = NULL,Dataset = NULL, ...)

Arguments

object a list of objects of class REBMIX containing outputs for a finite set of size s ofclasses Ωg .

P a vector of length s containing prior probabilites.

Dataset a data frame containing test classes.

... further arguments to predict.

Value

predictive classes.

References

Duda RO, Hart PE (1973). Pattern Classification and Scene Analysis. John Wiley & Sons, NewYork.

Page 24: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

24 predict.list

Examples

## Not run:data("adult")

## Find complete cases.

adult <- adult[complete.cases(adult), ]

## Map metric attributes.

adult[["Capital.Loss"]] <- ordered(cut(adult[["Capital.Loss"]], 2000))adult[["Capital.Gain"]] <- ordered(cut(adult[["Capital.Gain"]], 2000))

## Show level attributes for binary and discrete variables.

levels(adult[["Type"]])levels(adult[["Workclass"]])levels(adult[["Education"]])levels(adult[["Marital.Status"]])levels(adult[["Occupation"]])levels(adult[["Relationship"]])levels(adult[["Race"]])levels(adult[["Sex"]])levels(adult[["Native.Country"]])levels(adult[["Income"]])

## Replace levels with numbers.

adult <- as.data.frame(data.matrix(adult))

## Levels should start with 0 for discrete distributions except for the## Dirac distribution.

f <- c("Type", "Workclass", "Education", "Marital.Status", "Occupation","Relationship", "Race", "Sex", "Native.Country", "Income")

adult[, f] <- adult[, f] - 1

## Split adult dataset into two train subsets for the two Incomes## and remove Type and Income columns.

trainle50k <- subset(adult, subset = (Type == 1) & (Income == 0),select = c(-Type, -Income))

traingt50k <- subset(adult, subset = (Type == 1) & (Income == 1),select = c(-Type, -Income))

trainall <- subset(adult, subset = Type == 1, select = c(-Type, -Income))

train <- as.factor(subset(adult, subset = Type == 1, select = c(Income))[, 1])

## Extract test dataset form adult dataset and remove Type## and Income columns.

Page 25: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

predict.list 25

testle50k <- subset(adult, subset = (Type == 0) & (Income == 0),select = c(-Type, -Income))

testgt50k <- subset(adult, subset = (Type == 0) & (Income == 1),select = c(-Type, -Income))

testall <- subset(adult, subset = Type == 0, select = c(-Type, -Income))

test <- as.factor(subset(adult, subset = Type == 0, select = c(Income))[, 1])

## Calculate prior probabilities.

P <- c(nrow(trainle50k), nrow(traingt50k))

P <- P / sum(P)

## Estimate number of components, component weights and component## parameters for Naive Bayes.

Variables <- c("continuous", "discrete", "continuous", "discrete","discrete", "discrete", "discrete", "discrete", "discrete","discrete", "discrete", "discrete", "discrete", "discrete")

pdf <- c("normal", "Dirac", "normal", "Dirac","Dirac", "Dirac", "Dirac", "Dirac", "Dirac","Dirac", "Dirac", "Dirac", "Dirac", "Dirac")

K <- list(13:44, 1, 13:44, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

ymin <- as.numeric(apply(trainall, 2, min))ymax <- as.numeric(apply(trainall, 2, max))

## In case of zero occurrences for discrete and integer variables## Laplace smoothing is applied.

LS <- list(NA, 0:7, NA, 0:15, ymin[5]:ymax[5], 0:6, 0:13,0:5, 0:4, 0:1, NA, NA, ymin[13]:ymax[13], 0:40)

adultest <- list()

for (i in c(1:14)) adultest[[i]] <- REBMIX(Dataset = list(as.data.frame(trainle50k[, i]),as.data.frame(traingt50k[, i])),Preprocessing = "histogram",D = if (Variables[i] == "continuous") 0.025 else 0.0,cmax = if (Variables[i] == "continuous") 15 else 100,Criterion = "D",Variables = Variables[i],pdf = pdf[i],K = K[[i]],ymin = ymin[i],ymax = ymax[i],b = 0.0)

Page 26: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

26 predict.list

plot(adultest[[i]], pos = 1, nrow = 1, ncol = 1)plot(adultest[[i]], pos = 2, nrow = 1, ncol = 1)

## Best-first feature subset selection.

c <- NULL; rvs <- 1:14; Error <- 1.0

for (i in 1:14) k <- NA

for (j in rvs) predictive <- predict(adultest[c(c, j)],

P = P,Dataset = trainall[, c(c, j)])

CM <- table(train, predictive)

Accuracy <- (CM[1, 1] + CM[2, 2]) / sum(CM)

if (1.0 - Accuracy < Error) Error <- 1.0 - Accuracy; k <- j

if (is.na(k)) break

else

c <- c(c, k); rvs <- rvs[-which(rvs == k)]

## Error on train dataset.

Error

## Selected features.

c

predictive <- predict(adultest[c],P = P,Dataset = testall[, c])

CM <- table(test, predictive)

Accuracy <- (CM[1, 1] + CM[2, 2]) / sum(CM)

Error <- 1.0 - Accuracy

## Error on test dataset.

Page 27: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

print.boot.REBMIX 27

Error

## End(Not run)

print.boot.REBMIX Prints Univariate or Multivariate boot.REBMIX Output

Description

Returns the print output for class boot.REBMIX.

Usage

## S3 method for class ’boot.REBMIX’print(x, ...)

Arguments

x an object of class boot.REBMIX.

... further arguments to print.

Value

print.boot.REBMIX returns (invisibly) the full value of x itself.

Examples

## Generate and print simulated dataset.

Theta <- rbind(pdf = rep("Weibull", 2),theta1 = c(10, 20),theta2 = c(2.0, 6.0))

simulated <- RNGMIX(Dataset = "simulated",rseed = -1,n = c(30, 70),Theta = Theta)

## Estimate number of components, component weights and component parameters.

simulatedest <- REBMIX(Dataset = simulated$Dataset,Preprocessing = "histogram",D = 0.025,cmax = 4,Criterion = "AIC",Variables = "continuous",pdf = "Weibull",K = 20:40,

Page 28: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

28 print.REBMIX

b = 1.0)

## Bootstrap finite mixture.

simulatedboot <- boot.REBMIX(x = simulatedest, Bootstrap = "nonparametric", B = 20)

simulatedboot

print.REBMIX Prints Univariate or Multivariate REBMIX Output

Description

Returns the print output at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’print(x, pos = 1, ...)

Arguments

x an object of class REBMIX.

pos a desired row number in summary to be printed. The default value is 1.

... further arguments to print.

Value

print.REBMIX returns (invisibly) the full value of x itself.

Examples

## Generate and print simulated dataset.

Theta <- rbind(pdf = rep("binomial", 3),theta1 = c(10, 10, 10),theta2 = c(0.1, 0.4, 0.9))

simulated <- RNGMIX(Dataset = "simulated",rseed = -1,n = c(40, 60, 50),Theta = Theta)

simulated

## Estimate number of components, component weights and component parameters.

simulatedest <- REBMIX(Dataset = simulated$Dataset,Preprocessing = "histogram",

Page 29: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

print.RNGMIX 29

D = 0.025,cmax = 6,Criterion = "AIC",Variables = "discrete",pdf = "binomial",Theta1 = 10,K = 1,b = 1.0)

## Print and plot finite mixture.

simulatedest

plot(simulatedest)

print.RNGMIX Prints Univariate or Multivariate RNGMIX Output

Description

Returns the print output for class RNGMIX.

Usage

## S3 method for class ’RNGMIX’print(x, ...)

Arguments

x an object of class RNGMIX.

... further arguments to print.

Value

print.RNGMIX returns (invisibly) the full value of x itself.

Examples

## Generate and print simulated dataset.

Theta <- rbind(pdf = rep("Poisson", 2),theta1 = c(1.0, 10.0))

simulated <- RNGMIX(Dataset = paste("simulated_", 1:5, sep = ""),rseed = -1,n = c(100, 50),Theta = Theta)

print(simulated, pos = 3)

Page 30: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

30 REBMIX

## Estimate number of components, component weights and component parameters.

simulatedest <- REBMIX(Dataset = simulated$Dataset,Preprocessing = "histogram",D = 0.025,cmax = 4,Criterion = "AIC",Variables = "discrete",pdf = "Poisson",K = 1,b = 1.0)

## Print and plot finite mixture.

print(simulatedest, pos = 2)

plot(simulatedest, pos = 3)

REBMIX REBMIX Algorithm for Univariate or Multivariate Finite Mixture Es-timation

Description

Returns the REBMIX algorithm output for mixtures of conditionally independent normal, lognor-mal, Weibull, gamma, binomial, Poisson or Dirac component densities.

Usage

REBMIX(Dataset = NULL, Preprocessing = NULL, D = 0.025, cmax = 15,Criterion = "AIC", Variables = NULL, pdf = NULL,Theta1 = NULL, Theta2 = NULL, K = NULL, ymin = NULL,ymax = NULL, b = 1.0, ar = 0.1, Restraints = "loose")

Arguments

Dataset a list of data frames of size n×d containing d-dimensional datasets. Each of thed columns represents one random variable. Number of observations n equalsthe number of rows in the datasets.

Preprocessing a character vector, giving the preprocessing types. One of "histogram", "Parzen window"or "k-nearest neighbour".

D a total of positive relative deviations standing for the maximum acceptable mea-sure of distance between predictive and empirical densities. It satisfies the re-lation 0 ≤ D ≤ 1. The default value is 0.025. However, if components witha low probability of occurrence are expected, it has to decrease. If D = 0 andb = 0 the mixture tends to c = cmax components.

cmax maximum number of components cmax > 0. The default value is 15.

Page 31: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

REBMIX 31

Criterion a character vector giving the infromation criterion types. One of default Akaike"AIC", "AIC3", "AIC4" or "AICc", Bayesian "BIC", consistent Akaike "CAIC",Hannan-Quinn "HQC", minimum description length "MDL2" or "MDL5", approxi-mate weight of evidence "AWE", classification likelihood "CLC", integrated clas-sification likelihood "ICL" or "ICL-BIC", partition coefficient "PC", total ofpositive relative deviations "D" or sum of squares error "SSE".

Variables a character vector of length d containing types of variables. One of "continuous"or "discrete".

pdf a character vector of length d containing continuous or discrete parametric fam-ily types. One of "normal", "lognormal", "Weibull", "gamma", "binomial","Poisson" or "Dirac".

Theta1 a vector of length d containing initial component parameters. One of nil =Number of categories− 1 for "binomial" distribution or "NA" otherwise.

Theta2 a vector of length d containing initial component parameters. The value is NULL.

K a vector or a list of vectors containing numbers of bins v for the histogram andthe Parzen window or numbers of nearest neighbours k for the k-nearest neigh-bour. There is no genuine rule to identify v or k. Consequently, the REBMIXalgorithm identifies them from the set K of input values by minimizing the infor-mation criterion. The Sturges rule v = 1 + log2(n), Log10 rule v = 10log10(n)or RootN rule v = 2

√n can be applied to estimate the limiting numbers of

bins or the rule of thumb k =√n to guess the intermediate number of nearest

neighbours.

ymin a vector of length d containing minimum observations. The default value isNULL.

ymax a vector of length d containing maximum observations. The default value isNULL.

b minimum weight multiplier 0 ≤ b ≤ 1 influences the number of componentswmin = 2Dmin((l − 1)b+ 1). The default value is 1.0.

ar acceleration rate 0 < ar ≤ 1. The default value is 0.1 and in most cases doesnot have to be altered.

Restraints a character string giving the restraints type. One of "rigid" or default "loose".The rigid restraints are obsolete and applicable for well separated componentsonly.

Value

Dataset a list of data frames of size n×d containing d-dimensional datasets. Each of thed columns represents one random variable. Number of observations n equalsthe number of rows in the datasets.

w a list of data frames each containing c component weights wl summing to 1.

Theta a list of data frames each containing c parametric family types pdfi. One of"normal", "lognormal", "Weibull", "gamma", "binomial", "Poisson" or"Dirac". Component parameters theta1.i follow the parametric family types.One of µil for normal and lognormal distributions and θil for Weibull, gamma,binomial, Poisson and Dirac distributions. Component parameters theta2.i

Page 32: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

32 REBMIX

follow theta1.i. One of σil for normal and lognormal distributions, βil forWeibull and gamma distributions and pil for binomial distribution.

Variables a character vector containing types of variables. One of "continuous" or "discrete".

pdf a character vector containing continuous or discrete parametric family types.One of "normal", "lognormal", "Weibull", "gamma", "binomial", "Poisson"or "Dirac".

Theta1 a vector containing initial component parameters. One of nil = Number of categories−1 for "binomial" distribution or "NA" otherwise.

Theta2 a vector containing initial component parameters. The value is NULL.

summary a data frame with additional information about dataset, preprocessing, D, cmax,information criterion type, b, ar, restraints type, optimal c, optimal v or k, yi0,optimal hi, information criterion IC, log likelihood logL and degrees of free-dom M .

pos position in the summary data frame at which log likelihood logL attains its max-imum.

all.Imax a list of all numbers of iterations.

all.c a list of all numbers of components.

all.IC a list of all information criteria.

all.logL a list of all log lekelihoods.

all.D a list of all totals of positive relative deviations.

References

Sturges HA (1926). The choice of a class interval. Journal of American Statistical Association, 21,65-66.

Nagode M, Fajdiga M (1998). A General Multi-Modal Probability Density Function Suitable forthe Rainflow Ranges of Stationary Random Processes. International Journal of Fatigue, 20, 211-223.

Nagode M, Fajdiga M (2000). An Improved Algorithm for Parameter Estimation Suitable for MixedWeibull Distributions. International Journal of Fatigue, 22, 75-80.

Nagode M, Klemenc J, Fajdiga M (2001). Parametric Modelling and Scatter Prediction of RainflowMatrices. International Journal of Fatigue, 23, 525-532.

Nagode M, Fajdiga M (2006). An Alternative Perspective on the Mixture Estimation Problem.Reliability Engineering & System Safety, 91, 388-397.

Nagode M, Fajdiga M (2011). The REBMIX Algorithm for the Univariate Finite Mixture Esti-mation. Communications in Statistics - Theory and Methods, 40(5), 876-892.

Nagode M, Fajdiga M (2011). The REBMIX Algorithm for the Multivariate Finite Mixture Es-timation. Communications in Statistics - Theory and Methods. 40(11), 2022-2034.

Page 33: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

RNGMIX 33

Examples

## Generate the complex 1 dataset.

n <- c(998, 263, 1086, 487, 213, 1076, 232,784, 840, 461, 773, 24, 811, 1091, 861)

Theta <- rbind(pdf = "normal",theta1 = c(688.4, 265.1, 30.8, 934, 561.6, 854.9, 883.7,758.3, 189.3, 919.3, 98, 143, 202.5, 628, 977),theta2 = c(12.4, 14.6, 14.8, 8.4, 11.7, 9.2, 6.3, 10.2,9.5, 8.1, 14.7, 11.7, 7.4, 10.1, 14.6))

complex1 <- RNGMIX(Dataset = "complex1",rseed = -1,n = n,Theta = Theta)

complex1

complex1$Dataset[[1]][1:20, ]

## Estimate number of components, component weights and component parameters.

v <- c(as.integer(1 + log2(sum(n))), ## Minimum v follows the Sturges rule.as.integer(2 * sum(n)^0.5)) ## Maximum v follows the RootN rule.

## Number of classes or nearest neighbours to be processed.

N <- as.integer(log(v[2] / (v[1] + 1)) / log(1 + 1 / v[1]))

K <- c(v[1], as.integer((v[1] + 1) * (1 + 1 / v[1])^(0:N)))

complex1est <- REBMIX(Dataset = complex1$Dataset,Preprocessing = "histogram",D = 0.0025,cmax = 30,Criterion = "BIC",Variables = "continuous",pdf = "normal",K = K,b = 0.0)

complex1est

## Plot the finite mixture.

plot(complex1est, npts = 1000)

RNGMIX Random Univariate or Multivariate Finite Mixture Generation

Page 34: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

34 RNGMIX

Description

Returns univariate or multivariate random datasets for mixtures of conditionally independent nor-mal, lognormal, Weibull, gamma, binomial, Poisson or Dirac component densities.

Usage

RNGMIX(Dataset = NULL, rseed = -1, n = NULL, Theta = NULL)

Arguments

Dataset a character vector containing list names of data frames of size n × d that d-dimensional datasets are written in.

rseed set the random seed to any negative integer value to initialize the sequence. Thefirst file in Dataset corresponds to it. For each next file the random seed isdecremented rseed = rseed − 1. The default value is -1.

n a vector or a data frame containing number of observations in classes nl, wherenumber of observations n =

∑cl=1 nl.

Theta a matrix or a data frame containing c parametric family types pdfi. One of"normal", "lognormal", "Weibull", "gamma", "binomial", "Poisson" or"Dirac". Component parameters theta1.i follow the parametric family types.One of µil for normal and lognormal distributions and θil for Weibull, gamma,binomial, Poisson and Dirac distributions. Component parameters theta2.ifollow theta1.i. One of σil for normal and lognormal distributions, βil forWeibull and gamma distributions and pil for binomial distribution.

Details

RNGMIX is based on the "Minimal" random number generator ran1 of Park and Miller with theBays-Durham shuffle and added safeguards that returns a uniform random deviate between 0.0 and1.0 (exclusive of the endpoint values).

Value

Dataset a list of data frames of size n×d containing d-dimensional datasets. Each of thed columns represents one random variable. Number of observations n equalsthe number of rows in the datasets.

w a data frame containing c component weights wl summing to 1.

Theta a data frame containing c parametric family types pdfi. One of "normal","lognormal", "Weibull", "gamma", "binomial", "Poisson" or "Dirac". Com-ponent parameters theta1.i follow the parametric family types. One of µil fornormal and lognormal distributions and θil for Weibull, gamma, binomial, Pois-son and Dirac distributions. Component parameters theta2.i follow theta1.i.One of σil for normal and lognormal distributions, βil for Weibull and gammadistributions and pil for binomial distribution.

Variables a character vector containing types of variables. One of "continuous" or "discrete".

ymin a vector of length d containing minimum observations.

ymax a vector of length d containing maximum observations.

Page 35: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

RNGMIX 35

References

Press WH, Teukolsky SA, Vetterling VT, Flannery BP (1992). Numerical Recipes in C: The Art ofScientific Computing. Cambridge University Press, Cambridge.

Examples

## Generate and print simulated dataset.

n <- c(75, 100, 125, 150, 175)

Theta <- rbind(pdf1 = rep("normal", 5),theta1.1 = c(10, 8.5, 12, 13, 7),theta2.1 = c(1, 1, 1, 2, 3),pdf2 = rep("normal", 5),theta1.2 = c(12, 10, 14, 15, 9),theta2.2 = c(1, 1, 1, 2, 3),pdf3 = rep("normal", 5),theta1.3 = c(10, 8.5, 12, 7, 13),theta2.3 = c(1, 1, 1, 2, 3),pdf4 = rep("normal", 5),theta1.4 = c(12, 10.5, 14, 9, 15),theta2.4 = c(1, 1, 1, 2, 3))

simulated <- RNGMIX(Dataset = paste("simulated_", 1:25, sep = ""),rseed = -1,n = n,Theta = Theta)

## Generate and print simulated dataset.

Theta <- rbind(pdf1 = rep("normal", 5),theta1.1 = c(10, 8.5, 12, 13, 7),theta2.1 = c(1, 1, 1, 2, 3),pdf2 = rep("Weibull", 5),theta1.2 = c(12, 10, 14, 15, 9),theta2.2 = c(2, 4.1, 3.2, 7.1, 5.3),pdf3 = rep("binomial", 5),theta1.3 = c(10, 8, 12, 7, 13),theta2.3 = c(0.1, 0.5, 0.9, 0.4, 0.2),pdf4 = rep("Dirac", 5),theta1.4 = c(1, 1, 1, 1, 1))

simulated <- RNGMIX(Dataset = paste("simulated_", 1:25, sep = ""),rseed = -1,n = n,Theta = Theta)

simulated

## Generate and print simulated dataset.

Theta <- rbind(pdf = rep("W", 5),

Page 36: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

36 summary.boot.REBMIX

theta1 = c(12, 10, 14, 15, 9),theta2 = c(2, 4.1, 3.2, 7.1, 5.3))

simulated <- RNGMIX(Dataset = paste("simulated_", 1:2, sep = ""),rseed = -1,n = n,Theta = Theta)

simulated

SSE Sum of Squares Error

Description

Returns the sum of squares error at pos for class REBMIX.

Usage

## S3 method for class ’REBMIX’SSE(object, pos = 1, ...)

Arguments

object an object of class REBMIX.

pos a desired row number in summary for which the information criterion is calcu-lated. The default value is 1.

... currently not used.

References

Bishop CM (1995). Neural Networks for Pattern Recognition. Clarendon Press, Oxford.

summary.boot.REBMIX Prints Univariate or Multivariate boot.REBMIX Summary

Description

Returns the boot.summary output for class boot.REBMIX.

Usage

## S3 method for class ’boot.REBMIX’summary(object, ...)

Page 37: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

summary.REBMIX 37

Arguments

object an object of class boot.REBMIX.

... further arguments to print.

Examples

## Generate and print simulated dataset.

Theta <- rbind(pdf = rep("normal", 4),theta1 = c(1, 3, 6, 9),theta2 = c(0.1, 0.2, 0.3, 0.4))

simulated <- RNGMIX(Dataset = "simulated",rseed = -1,n = c(10, 20, 30, 40),Theta = Theta)

## Estimate number of components, component weights and component parameters.

simulatedest <- REBMIX(Dataset = simulated$Dataset,Preprocessing = "histogram",D = 0.025,cmax = 8,Criterion = "AIC",Variables = "continuous",pdf = "normal",K = 30:40,b = 1.0)

## Bootstrap finite mixture and print summary.

simulatedboot <- boot.REBMIX(x = simulatedest, B = 20)

summary(simulatedboot)

summary.REBMIX Prints Univariate or Multivariate REBMIX Summary

Description

Returns the summary output for class REBMIX.

Usage

## S3 method for class ’REBMIX’summary(object, ...)

Page 38: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

38 weibull

Arguments

object an object of class REBMIX.

... further arguments to print.

Examples

## Generate and print simulated dataset.

Theta <- rbind(pdf = rep("Dirac", 4),theta1 = c(1, 3, 4, 8))

simulated <- RNGMIX(Dataset = "simulated",rseed = -1,n = c(40, 20, 50, 10),Theta = Theta)

simulated

## Estimate number of components, component weights and component parameters.

simulatedest <- REBMIX(Dataset = simulated$Dataset,Preprocessing = "histogram",D = 0.025,cmax = 8,Criterion = "D",Variables = "discrete",pdf = "Dirac",K = 1,b = 0.0)

## Print summary and plot finite mixture.

summary(simulatedest)

plot(simulatedest)

weibull Weibull Dataset 8.1

Description

The complete data are the failure times in weeks.

Usage

weibull

Page 39: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

wine 39

Format

weibull is a data frame with 50 cases (rows) and 1 variables (columns) named:

1. Failure.Time continuous.

References

Murthy DNP, Jiang MXR (2004). Weibull Models. John Wiley & Sons, New York.

Examples

data("weibull")

wine Wine Recognition Data

Description

These data are the results of a chemical analysis of wines grown in the same region in Italy but de-rived from three different cultivars (1-3). The analysis determined the quantities of 13 constituents:alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phe-nols, proanthocyanins, colour intensity, hue, OD280/OD315 of diluted wines, and proline found ineach of the three types of the wines. The number of instances in classes 1 to 3 is 59, 71 and 48,respectively.

Usage

wine

Format

wine is a data frame with 178 cases (rows) and 14 variables (columns) named:

1. Alcohol continuous.2. Malic.Acid continuous.3. Ash continuous.4. Alcalinity.of.Ash continuous.5. Magnesium continuous.6. Total.Phenols continuous.7. Flavanoids continuous.8. Nonflavanoid.Phenols continuous.9. Proanthocyanins continuous.

10. Color.Intensity continuous.11. Hue continuous.12. OD280.OD315.of.Diluted.Wines continuous.13. Proline continuous.14. Cultivar continuous.

Page 40: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

40 wine

Source

Frank A, Asuncion A (2010). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.

References

Roberts SJ, Everson R, Rezek I (2000). Maximum Certainty Data Partitioning. Pattern Recognition,33, 833-839.

Examples

data("wine")

colnames(wine)

## Split wine dataset into three subsets for three Cultivars## and remove Cultivar column from datasets.

winecolnames <- !(colnames(wine) %in% "Cultivar")

wine1 <- wine[wine$Cultivar == 1, winecolnames]wine2 <- wine[wine$Cultivar == 2, winecolnames]wine3 <- wine[wine$Cultivar == 3, winecolnames]

wine <- wine[ , winecolnames]

## Write datasets without Cultivar information into tab## delimited ASCII files.

write.table(wine, file = "wine.txt", sep = "\t",eol = "\n", row.names = FALSE, col.names = FALSE)

write.table(wine1, file = "wine1.txt", sep = "\t",eol = "\n", row.names = FALSE, col.names = FALSE)

write.table(wine2, file = "wine2.txt", sep = "\t",eol = "\n", row.names = FALSE, col.names = FALSE)

write.table(wine3, file = "wine3.txt", sep = "\t",eol = "\n", row.names = FALSE, col.names = FALSE)

Page 41: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

Index

∗Topic bootstrapboot, 7

∗Topic classificationpredict.list, 23

∗Topic datasetsadult, 2galaxy, 14weibull, 38wine, 39

∗Topic distributionsdemix, 10dfmix, 12pemix, 17pfmix, 18

∗Topic information criterionAIC, 5AWE, 6BIC, 6CLC, 8HQC, 15ICL, 15ICLBIC, 16logL, 16MDL, 17PRD, 22SSE, 36

∗Topic parameter estimationREBMIX, 30

∗Topic plotplot.REBMIX, 20

∗Topic printcoef.REBMIX, 9print.boot.REBMIX, 27print.REBMIX, 28print.RNGMIX, 29summary.boot.REBMIX, 36summary.REBMIX, 37

∗Topic random number generationRNGMIX, 33

adult, 2AIC, 5AIC3 (AIC), 5AIC4 (AIC), 5AICc (AIC), 5AWE, 6

BIC, 6boot, 7

CAIC (AIC), 5CLC, 8coef.REBMIX, 9contour, 21

demix, 10dfmix, 12

galaxy, 14

HQC, 15

ICL, 15ICLBIC, 16

logL, 16

MDL, 17MDL2 (MDL), 17MDL5 (MDL), 17

par, 21pemix, 17pfmix, 18plot.default, 21plot.REBMIX, 20points, 21PRD, 22predict, 23predict.list, 23print, 9, 27–29, 37, 38

41

Page 42: Package ‘rebmix’ · demix 11 Arguments x a vector, a matrix or a data frame containing continuous or discrete vector ob-servations x. Preprocessing a preprocessing type. One of

42 INDEX

print.boot.REBMIX, 27print.REBMIX, 28print.RNGMIX, 29

REBMIX, 30RNGMIX, 33

sample, 7SSE, 36summary.boot.REBMIX, 36summary.REBMIX, 37

weibull, 38wine, 39