algorithmic criminology

Upload: dumitru-ambroci

Post on 10-Mar-2016

19 views

Category:

Documents


0 download

DESCRIPTION

Criminology

TRANSCRIPT

  • Algorithmic Criminology

    Richard BerkDepartment of Statistics

    Department of CriminologyUniversity of Pennsylvania

    11/5/2012

    AbstractComputational criminology has been seen primarily as computer-

    intensive simulations of criminal wrongdoing. But there is a growingmenu of computer-intensive applications in criminology that one mightcall computational, which employ dierent methods and have dier-ent goals. This paper provides an introduction to computer-intensive,tree-based, machine learning as the method of choice, with the goalof forecasting criminal behavior. The approach is black box, forwhich no apologies are made. There are now in the criminology lit-erature several such applications that have been favorably evaluatedwith proper hold-out samples. Peeks into the black box indicate thatconventional, causal modeling in criminology is missing significant fea-tures of crime etiology.

    Keywords: machine learning; forecasting; criminal behavior; clas-sification; random forests; stochastic gradient boosting; Bayesian ad-ditive regression trees

    1 Introduction

    Computational Criminology is a hybrid of computer science, applied math-ematics, and criminology. Procedures from computer science and appliedmathematics are used to animate theories about crime and law enforcement[1,2,3]. The primary goal is to learn how underlying mechanisms work; com-putational criminology is primarily about explanation. Data play a secondary

    1

  • role either to help tune the simulations or, at least ideally, to evaluate howwell the simulations perform [4].

    There are other computer-intensive application in criminology that onemight also call computational. Procedures from statistics, computer sci-ence, and applied mathematics can be used to develop powerful visualizationtools that are as engaging as they are eective [5,6]. These tools have beenrecently used in a criminology application [7], and with the growing popu-larity of electronic postings, will eventually become important componentsof circulated papers and books.

    There are also a wide variety of computer-intensive methods used inlaw enforcement to assemble datasets, provide forensic information, or morebroadly to inform administrative activities such as COMPSTAT [8]. Al-though these methods can be very useful for criminal justice practice, theirrole in academic criminology has yet to be clearly articulated.

    In this paper, yet another form of computational criminology is discussed.Very much in the tradition of exploratory data analysis developed by JohnTukey, Frederick Mosteller and others three decade ago [9], powerful com-putational tools are being developed to inductively characterize importantbut elusive structures in a dataset. The computational muscle has grown sorapidly over the past decade that the new applications have the look and feelof dramatic, qualitative advances. Machine learning is probably the poster-child for these approaches [10,11,12,13].

    There are many new journals specializing in machine learning (e.g., Jour-nal of Machine Learning) and many older journals that are now routinelysprinkled with machine learning papers (e.g., Journal of the American Sta-tistical Association). So far, however, applications in criminology are hardto find. Part of the reason is time; it takes a while for new technology todiuse. Part of the reason is software; the popular statistical packages canbe at least five years behind recent developments. Yet another part of thereason is the need for a dramatic attitude adjustment among criminologists.Empirical research in criminology is thoroughly dominated by a culture ofcausal modeling in which the intent is to explain in detail the mechanisms bywhich nature generated the values of a response variable as a particular func-tion of designated predictors and stochastic disturbances.1 Machine learningcomes from a dierent culture characterized by an algorithmic perspective.

    The approach is that nature produces data in a black box whoseinsides are complex, mysterious, and, at least, partly unknowable.

    2

  • What is observed is a set of x0s that go in and a subsequent setof y0s that come out. The problem is to find an algorithm f(x)such that for future x in a test set, f(x) will be a good predictorof y. [15]

    As I discuss at some length elsewhere [16], the most common applica-tions of machine learning in criminology have been to inform decisions aboutwhom to place on probation, the granting of parole, and parole supervisionpractices. These are basically classification problems that build directly onparole risk assessments dating back to the 1920s. There are related appli-cations informing police decisions in domestic violence incidents, the place-ment of inmates in dierent security levels, and the supervision of juvenilesalready in custody. These can all be seen successful forecasting exercises,at least in practical terms. Current decisions are informed by projectionsof subsequent risk. Such criminal justice applications guide the discussionof machine learning undertaken here. We will focus on tree-based, machinelearning procedures as an instructive special case.

    Four broad points will be made. First, machine learning is computationalnot just because it is computer-intensive, but because it relies algorithmicprocedures rather than causal models.2 Second, the key activity is data ex-ploration in ways that can be surprisingly thorough. Patterns in the datacommonly overlooked by conventional methods can be eectively exploited.Third, the forecasting procedures can be hand-tailored so that the conse-quences of dierent kinds of forecasting errors can be properly anticipated.In particular, false positives can be given more or less weight than false neg-atives. Finally, the forecasting skill can be impressive, at least relative topast eorts.

    2 Conceptual Foundations

    It all starts with what some call meta-issues. These represent the concep-tual foundation on which any statistical procedure rests. Without a solidconceptual foundation, all that follows will be ad hoc. Moreover, the concep-tual foundation provides whatever links there may be between the empiricalanalyses undertaken and subject-matter theory or policy applications.

    3

  • 2.1 Conventional Regression Models

    Conventional causal models are based on a quantitative theory of how thedata were generated. Although there can be important dierence in detail,the canonical account takes the form of a linear regression model such as

    yi = Xi + "i, (1)

    where for each case i, the response yi is a linear function of fixed predictorsXi (ususally including a column if 1s for the intercept), with regressioncoecients , and a disturbance term "i NIID(0,2). For a given case,nature (1) sets the value of each predictor, (2) combines them in a linearfashion using the regression coecients as weights, (3) adds the value of theintercept, and (4) adds a random disturbance from a normal distributionwith a mean of zero and a given variance. The result is the value of yi.Nature can repeat these operations a limitless number of times for a givencase with the random disturbances drawn independently of one another. Thesame formulation applies to all cases.

    When the response is categorical or a count, there are some dierencesin how nature generates the data. For example, if the response variable Y isbinary,

    pi =1

    1 + e(Xi), (2)

    where pi is the probability of some event defined by Y. Suppose that Y iscoded 1 if a particular event occurs and 0 otherwise. (e.g., A parolee isarrested or not.) Nature combines the predictors as before, but now appliesa logistic transformation to arrive at a value for pi. (e.g., The cumulativenormal is also sometimes used.) That probability leads to the equivalent ofa coin flip with the probability that the coin comes up 1 equal to pi. Theside on which that coin lands determines for case i if the response is a 1or a 0. As before, the process can be repeated independently a limitlessnumber of times for each case.

    The links to linear regression become more clear when Equation 2 isrewritten as

    ln

    pi

    1 pi

    != Xi, (3)

    where pi is, again, the probability of the some binary response whose logitdepends linearly on the predictors.3

    4

  • For either Equation 1, 2 or 3, a causal account can be overlaid by claimingthat nature can manipulate the value of any given predictor independently ofall other predictors. Conventional statistical inference can also be introducedbecause the sources of random variation are clearly specified and statisticallytractable.

    Forecasting would seem to naturally follow. With an estimate Xi inhand, new values for X can be inserted to arrive at values for Y that maybe used as forecasts. Conventional tests and confidence interval can thenbe applied. There are, however, potential conceptual complications. If Xis fixed, how does one explain the appearance of new predictor values X

    whose outcomes one wants to forecast? For better or worse, such mattersare typically ignored in practice.

    Powerful critiques of conventional regression have appeared since the1970s. They are easily summarized: the causal models popular in crim-inology, and in the social sciences more generally, are laced with far toomany untestable assumptions of convenience. The modeling has gotten farout ahead of existing subject-matter knowledge.4 Interested readers shouldconsult the writings of economists such as Leamer, LaLonde, Manski, Im-bens and Angrist, and statisticians such as Rubin, Holland, Breiman, andFreedman. I have written on this too [19].

    2.2 The Machine Learning Model

    Machine Learning can rest on a rather dierent model that demands far lessof nature and of subject matter knowledge. For given case, nature generatesdata as a random realization from a joint probability distribution for somecollection of variables. The variables may be quantitative or categorical. Alimitless number of realizations can be independently produced from thatjoint distribution. The same applies to every case. Thats it.

    From natures perspective, there are no predictors or response variables.It follows that there is no such thing as omitted variables or disturbances. Of-ten, however, researchers will use subject-matter considerations to designateone variable as a response Y and other variables as predictors X. It is thensometimes handy to denote the joint probability distribution as Pr(Y,X).One must be clear that the distinction between Y and X has absolutelynothing to do with how the data were generated. It has everything to do thewhat interests the researcher.

    For a quantitative response variable in Pr(Y,X), researchers often want

    5

  • to characterize how the means of the response variable, denoted here by ,may be related to X. That is, researchers are interested in the conditionaldistribution |X. They may even write down a regression-like expression

    yi = f(Xi) + i, (4)

    where f(Xi) is the unknown relationship in natures joint probability distri-bution for which

    i = f(Xi). (5)

    It follows that the mean of i in the joint distribution equals zero.5 Equa-tions 4 and 5 constitute a theory of how the response is related to the pre-dictors in Pr(Y,X). But any relationships between the response and thepredictors are merely associations. There is no causal overlay. Equation 4is not a causal model. Nor is it a representation of how the data were gener-ated we already have a model for that.

    Generalizations to categorical response variables and their conditionaldistributions can be relatively straightforward. We denote a given outcomeclass by Gk, with classes k = 1 . . . K (e.g., for K = 3, released on bail,released on recognizance, not released). For natures joint probability distri-bution, there can be for any case i interest in the conditional probability ofany outcome class: pki = f(Xi). There also can be interest in the conditionaloutcome class itself: gki = f(Xi).6

    One can get from the conditional probability to the conditional classusing the Bayes classifier. The class with the largest probability is the classassigned to a case. For example, if for a given individual under supervisionthe probability of failing on parole is .35, and the probability of succeedingon parole is .65, the assigned class for that individual is success. It is alsopossible with some estimation procedures to proceed directly to the outcomeclass. There is no need to estimate intervening probabilities.

    When the response variable is quantitative, forecasting can be undertakenwith the conditional means for the response. If f(X) is known, predictorvalues are simply inserted. Then = f(X), where as before, X representsthe predictor values for the cases whose response values are to be forecasted.The same basic rationale applies when outcome is categorical, either throughthe predicted probability or directly. That is, pk = f(X

    ) and Gk = f(X).The f(X) is usually unknown. An estimate, f(X), then replaces f(X)

    when forecasts are made. The forecasts become estimates too. (e.g., be-comes .) In a machine learning context, there can be dicult complications

    6

  • for which satisfying solutions may not exist. Estimation is considered in moredepth shortly.

    Just like the conventional regression model, the joint probability distribu-tion model can be wrong too. In particular, the assumption of independentrealizations can be problematic for spatial or temporal data, although inprinciple, adjustments for such diculties sometimes can be made. A nat-ural question, therefore, is why have any model at all? Why not just treatthe data as a population and describe its important features?

    Under many circumstances treating the data as all there is can be a fineapproach. But if an important goal of the analysis is to apply the findingsbeyond the data on hand, the destination for those inferences needs to beclearly defined, and a mathematical road map to the destination provided.A proper model promises both. If there is no model, it is very dicult togeneralize any findings in a credible manner.7

    A credible model is critical for forecasting applications. Training dataused to build a forecasting procedure and subsequent forecasting data forwhich projections into the future are desired, should be realizations of thesame data generation process. If they are not, formal justification for anyforecasts breaks down and at an intuitive level, the enterprise seems mis-guided. Why would one employ a realization from one data generation pro-cess to make forecasts about another data generation process?8

    In summary, the joint probability distribution model is simple by con-ventional regression modeling standards. But it provides nevertheless an in-structive way for thinking about the data on hand. It is also less restrictiveand far more appropriate for an inductive approach to data analysis.

    3 Estimation

    Even if one fully accepts the joint probably distribution model, its use inpractice depends on estimating some of its key parameters. There are thenat least three major complications.

    1. In most cases, f(X) is unknown. In conventional linear regression,one assumes that f(X) is linear. The only unknowns are the values ofthe regression coecients and the variance of the disturbances. Withthe joint probability distribution model, any assumed functional formis typically a matter descriptive convenience [14]. It is not informedby the model. Moreover, the functional form is often arrived at in

    7

  • an inductive manner. As result, the functional form as well as itsparameters usually needs to be estimated.

    2. Any analogy to covariance adjustments is also far more demanding.To adjust the fitted values for associations among predictors, one mustknow the functional forms. But one cannot know those functional formsunless the adjustments are properly in place.

    3. X is now a random variable. One key consequence is that estimatesof f(X) can depend systematically on which values of the predictorshappen to be in the realized data. There is the likely prospect of bias.Because f(X) is allowed to be nonlinear, which parts of the functionone can see depends upon which values of the predictors are realizedin the data. For example, a key turning point may be systematicallymissed in some realizations. When f(X) is linear, it will materialize aslinear no matter what predictor values are realized.

    This is where the computational issues first surface. There are usefulresponses all three problems if one has the right algorithms, enough computermemory, and one or more fast CPUs. Large samples are also important.

    3.1 Data Partitions as a Key Idea

    We will focus on categorical outcomes because they are far more commonthan quantitative outcomes in the criminal justice applications emphasizedhere. Examples of categorical outcomes include whether or not an individualon probation or parole is arrested for a homicide [20], whether there is arepeat incident of domestic violence in a household [21], and dierent ruleinfractions for which a prison inmate may have been reported [22].

    Consider a 3-dimensional scatter plot of sorts. The response is three color-coded kinds of parole outcomes: an arrest for a violent crime (red), an arrestfor a crime that is not violent (yellow), and no arrest at all (green). Thereare two predictors in this cartoon example: age in years and the number ofprior arrests.

    The rectangle is a two-dimension predictor space. In that space, there areconcentrations of outcomes by color. For example, there is a concentration ofred circles toward the left hand side of the rectangle, and a concentration ofgreen circles toward the lower right. The clustering of certain colors meansthat there is structure in the data, and because the predictor space is defined

    8

  • Classification by Linear Partitioning

    Age

    Priors

    Many

    None

    Younger Older

    4. Evaluate with out-of sample data.

    Red = Violent CrimeYellow = Nonviolent Crime

    Green = No Crime

    Figure 1: Data Partitions for Thee Categorical Outcomes By Age and theNumber of Priors

    by age and priors, the structure can be given substantive meaning. Youngerindividuals and individuals with a greater number of priors, for instance, aremore likely to be arrested for violent crimes.

    To make use of the structure in the data, a researcher must locate thatstructure in the predictor space. One way to locate the structure is to parti-tion the predictor space in a manner that tends to isolate important patterns.There will be, for example, regions in which violent oenders are dispropor-tionately found, or regions where nonviolent oenders are disproportionatelyfound.

    Suppose the space is partitioned as in Figure 1, where the partitions aredefined by the horizontal and vertical lines cutting through the predictorspace. Now what? The partitions can be used to assign classes to obser-vations. Applying the Bayes classifier, the partition at the upper right, forexample, would be assigned the class of no crime the vote is 2 to 1. Thepartition at the upper left would be assigned the class of violent crime the vote is 4 to 1. The large middle partition would be assigned the classof nonviolent crime the vote is 7 to 2 to 1. Classes would be assigned toeach partition by the same reasoning. The class with the largest estimated

    9

  • probability wins.The assigned classes can be used for forecasting. Cases with unknown

    outcomes but predictor values for age and priors can be located in the pre-dictor space. Then, the class of the partition in which the case falls can serveas a forecast. For example, a case falling in the large middle partition wouldbe forecasted to fail through an arrest for a nonviolent crime.

    The two predictors function much like longitude and latitude. They locatea case in the predictor space. The partition in which a case falls determinesits assigned class. That class can be the forecasted class. But there need benothing about longitude and latitude beyond their role as map coordinates.One does not need to know that one is a measure of age, and one is a measureof the number of priors. We will see soon, somewhat counter-intuitively, thatseparating predictors from what they are supposed to measure can improveforecasting accuracy enormously. If the primary goal is to search for struc-ture, how well one searchers drives everything else. This is a key feature ofthe algorithmic approach.

    Nevertheless, in some circumstances, the partitions can be used to de-scribed how the predictors and the response are related in subject matterterms. In this cartoon illustration, younger individuals are much more likelyto commit a violent crime, individuals with more priors are much more likelyto commit a violent crime, and there looks to be a strong statistical inter-action eect between the two. By taking into account the meaning of thepredictors that locate the partition lines (e.g., priors more than 2 and ageless than 25) the meaning of any associations sometimes can be made moreclear. We can learn something about f(X).

    How are the partitions determined? The intent is to carve up the spaceso that overall the partitions are as homogeneous as possible with respectto the outcome. The lower right partition, for instance, has six individualswho were not arrested and one individual arrested for a nonviolent crime.Intuitively, that partition is quite homogeneous. In contrast, the large middlepartition has two individuals who were not arrested, seven individuals whowere arrested for a crime that was not violent, and one individual arrestedfor a violent crime. Intuitively, that partition is less homogenous. One mightfurther intuit that any partition with equal numbers of individuals for eachoutcome class is the least homogenous it can be, and that any partition withall cases in a single outcome class is the most homogeneous it can be.

    These ideas can be made more rigorous by noting that with greater ho-mogeneity partition by partition, there are fewer classification errors overall.

    10

  • For example, the lower left partition has an assigned class of violent crime.There is, therefore, one classification error in that partition. The large mid-dle category has an assigned class of nonviolent crime, and there are threeclassification errors in that partition. One can imagine trying to partitionthe predictor space so that the total number of classification errors is as smallas possible. Although for technical reasons this is rarely the criterion usedin practice, it provides a good sense of the intent. More details are providedshortly.

    At this point, we need computational muscle to get the job done. Oneoption is to try all possible partitions of the data (except the trivial one inwhich partitions can contain a single observation). However, this approachis impractical, especially as the number of predictors grows, even with verypowerful computers.

    A far more practical and surprisingly eective approach is to employ agreedy algorithm. For example, beginning with no partitions, a single par-tition is constructed that minimizes the sum of the classification errors inthe two partitions that result. All possible splits for each predictor are eval-uated and the best split for single predictor is chosen. The same approachis applied separately to each of the two new partitions. There are now four,and the same approach is applied once again separately to each. This recur-sive partitioning continues until the number of classification errors cannot befurther reduced. The algorithm is called greedy because it takes the bestresult at each step and never looks back; early splits are never revisited.

    Actual practice is somewhat more sophisticated. Beginning with the firstsubsetting of the data, it is common to evaluate a function of the proportionof cases in each outcome class (e.g., .10 for violent crime, .35 for nonviolentcrime, and .55 for no crime). Two popular functions of proportions are

    Gini Index :Xk 6=k0

    pmkpmk0 (6)

    and

    Cross Entropy or Deviance : KXk=1

    pmklogpmk. (7)

    The notation denotes dierent estimates of proportions pmk over dierentoutcome categories indexed by k, and dierent partitions of the data in-dexed by m. The Gini Index and the Cross-Entropy take advantage of thearithmetic fact that when the proportions over classes are more alike, their

    11

  • A Classification Tree

    Less Than 25

    25 or Older

    Full Sample

    Less Than 3 Priors

    3 or More Priors

    Less Than 30

    30 or Older

    Less Than 5 Priors

    5 or More Priors

    2,1,0 4,1,0 1,6,2

    0,1,5 0,1,2

    Figure 2: A Classication Tree For The Three Parole Outcomes and PreditorsAge and Prior Record

    product is larger (e.g., [.5 .5] > [.6 .4]). Intuitively, when the proportionsare more alike, there is less homogeneity. For technical reasons we cannotconsider here, the Gini Index is probably the preferred measure.

    Some readers have may have already figured out that this form of recursivepartitioning is the approach used for classification trees [23]. Indeed, theclassification tree shown in Figure 2 is consistent with the partitioning shownin Figure 1.9 The final partitions are color-coded for the class assigned byvote, and the number of cases in each final partition are color coded for theiractual outcome class. In classification tree parlance, the full dataset at thetop before any partitioning is called the root node, and the final partitionsat the bottom are called terminal nodes.

    There are a number of ways this relatively simple approach can be ex-tended. For example, the partition boundaries do not have to be linear.There are also procedures called pruning that can be used to remove lowernodes having too few cases or that do not suciently improve the Gini Index.

    Classification trees are rarely used these days as stand-alone procedures.They are well known to be very unstable over realizations of the data, es-pecially if one wants to use the tree structure for explanatory purposes. In

    12

  • addition, implicit in the binary partitions are step functions a classifi-cation tree can be written as a form of regression in which the right handside is a linear combination of step functions. However, it will be unusual ifstep functions provide a good approximation of f(X). Smoother functionsare likely to be more appropriate. Nevertheless, machine learning proceduresoften make use of classification trees as a component of much more eectiveand computer-intensive algorithms. Refinements that might be used for clas-sification trees themselves are not needed; classification trees are means toan estimation end, not the estimation end itself.

    4 Random Forests

    Machine learning methods that build on classification trees have proved veryeective in criminal justice classification and forecasting applications. Ofthose, random forecasts has been by far the most popular. We consider nowrandom forests, but will provide a brief discussion of two other tree-basedmethods later. They are both worthy competitors to random forests.

    A good place to start is with the basic random forests algorithm thatcombines the results from a large ensemble of classification trees [24].

    1. From a training dataset with N observations, a random sample of sizeN is drawn with replacement. A classification tree is grown for thechosen observations. Observations that are not selected are stored asthe out-of-bag (OOB) data. They serve as test data for that tree andwill on average be about a third the size of the original training data.

    2. A small sample of predictors is drawn at random (e.g., 3 predictors).

    3. After selecting the best split from among the random subset of predic-tors, the first partition is determined. There are then two subsets ofthe data that together maximize the improvement in the Gini index.

    4. Steps 2 and 3 are repeated for all later partitions until the models fitdoes not improve or the observations are spread too thinly over terminalnodes.

    5. The Bayes classifier is applied to each terminal node to assign a class.

    13

  • 6. The OOB data are dropped down the tree. Each observation isassigned the class associated with the terminal node in which it lands.The result is the predicted class for each observation in the OOB datafor that tree.

    7. Steps 1 through 6 are repeated a large number of times to produce alarge number of classification trees. There are usually several hundredtrees or more.

    8. For each observation, the final classification is by vote over all treeswhen that observation is OOB. The class with the most votes is chosen.That class can be used for forecasting when the predictor values areknown but the outcome class is not.

    Because random forests is an ensemble of classification trees, many of thebenefits from recursive partitioning remain. In particular, nonlinear functionsand high order interaction eects can found inductively. There is no need tospecify them in advance.

    But, random forests brings its own benefits as well. The sampling oftraining data for each tree facilitates finding structures that would ordinarilybe overlooked. Signals that might be weak in one sample might be strongin another. Each random sample provides look at the predictor space fromdierent vantage point.

    Sampling predictors at each split results in a wide variety of classificationtrees. Predictors that might dominate in one tree are excluded at randomfrom others. As a result, predictors that might otherwise be masked cansurface. Sampling predictors also means that the number of predictors inthe training data can be greater than the number of observations. This isforbidden in conventional regression models. Researchers do not have to beselective in the predictors used. Kitchen sink specifications are fine.

    The consequences of dierent forecasting errors are rarely the same, andit follows that their costs can dier too, often dramatically. Random forestsaccommodates in several ways dierent forecasting-error costs. Perhaps thebest way is to use stratified sampling each time the training data are sampled.Oversampling the less frequent outcomes changes the prior distribution of theresponse and gives such cases more weight. In eect, one is altering the lossfunction. Asymmetric loss functions can be built into the algorithm rightfrom the start. An illustration is provided below.

    14

  • Forecasted Not Fail Forecasted Fail AccuracyNot Fail 9972 987 .91Fail 45 153 .77

    Table 1: Confusion Table for Forecasts of Perpetrators and Victims. Truenegatives are identified with 91% accuracy. True positives are identified with77% accuracy. (Fail = perpetrator or victim. No Fail = not a perpetrator orvictim.)

    Random forests does not overfit [24] even if thousands are trees are grown.The OOB data serve as a test sample to keep the procedure honest. Forother popular machine learning procedures, overfitting can be a problem.One important consequence is that random forests can provide consistentestimates of generalization error in natures joint probability distribution forthe particular response and predictors employed.10

    4.1 Confusion Tables

    The random forests algorithm can provide several dierent kinds of output.Most important is the confusion table. Using the OOB data, actual out-come classes are cross-tabulated against forecasted outcome classes. There isa lot of information in such tables. In this paper, we only hit the highlights.

    Illustrative data come from a homicide prevention project for individualson probation or parole [25]. A failure was defined as (1) being arrestedfor homicide, (2) being arrest for an attempted homicide, (3) being a homi-cide victim, or (4) being a victim of a non-fatal shooting. Because for thispopulation, perpetrators and victims often had the same profiles, no empir-ical distinction was made between the two. If a homicide was prevented,it did not matter if the intervention was with a prospective perpetrator orprospective victim.

    However, prospective perpetrators or victims had first to be identified.This was done by applying random forests with the usual kinds predictorvariables routinely available (e.g., age, prior record, age at first arrest, historyof drug use, and so on). The number of classification trees was set at 500.11

    Table 1 shows a confusion table from that project. The results are broadlyrepresentative of recent forecasting performance using random forests [10]. Ofthose who failed, about 77% were correctly identified by the random forests

    15

  • algorithm. Of those who did not fail, 91% were correctly identified by thealgorithm. Because the table is constructed from OOB data, these figurescapture true forecasting accuracy.

    But was this good enough for stakeholders? The failure base rate wasabout 2%. Failure was, thankfully, a rare event. Yet, random forests was ableto search through a high dimensional predictor space containing over 10,000observations and correctly forecast failures about 3 times out of 4 amongthose who then failed. Stakeholders correctly thought this was impressive.

    How well the procedure would perform in practice is better revealed bythe proportion of times when a forecast is made, the forecast is correct. This,in turn, depends on stakeholders costs of false positives (i.e., individuals in-correctly forecasted to be perpetrators or victims) relative to the costs of falsenegatives (i.e., individuals incorrectly forecasted to neither be perpetratorsnor victims). Because the relative costs associated with failing to correctlyidentify prospective perpetrators or victims were taken to be very high, asubstantial number of false positives were to be tolerated. The cost ratio offalse negatives to false positives was set at 20 to 1 a priori and built into thealgorithm through the disproportional stratified sampling described earlier.This meant that relatively weak evidence of failure would be sucient toforecast a failure. The price was necessarily an increase in the number ofindividuals incorrectly forecasted to be failures.

    The results reflect this policy choice; there are in the confusion tableabout 6.5 false positives for every true positive (i.e., 987/153). As a result,when a failure is the forecasted, is it correct only about 15% of the time.When a success is the forecasted, it is correct 99.6% of the time. This alsoresults from the tolerance for false positives. When a success is forecasted,the evidence is very strong. Stakeholders were satisfied with these figures,and the procedures were adopted.

    4.2 Variable Importance For Forecasting

    Although forecasting is the main goal, information on the predictive impor-tance of each predictor also can be of interest. Figure 3 is an example fromanother application [16]. The policy question was whether to release an in-dividual on parole. Each inmates projected threat to public safety had tobe a consideration in the release decision.

    The response variable defined three outcome categories measured over 2years on parole: being arrested for a violent crime (Level 2), being arrested

    16

  • for a crime but not a violent one (Level 1), and not being arrested at all(Level 0). The goal was to assign one such outcome class to each inmatewhen a parole was being considered. The set of predictors included noth-ing unusual except that several were derived from behavior while in prison:Charge Record Count, Recent Report Count, and Recent Category 1Count refer to misconduct in prison. Category1 incidents were consideredserious.

    Figure 3 shows the predictive importance for each predictor. The baselineis the proportion of times the true outcome is correctly identified (as shownin the rows of a confusion table). Importance is measured by the drop inaccuracy when each predictor in turn is precluded from contributing. This isaccomplished by randomly shuing one predictor at a time when forecastsare being made. The set of trees constituting the random forests is notchanged. All that changes is the information each predictor brings when aforecast is made.12

    Because there are three outcome classes, there are three such figures.Figure 3 shows the results when an arrest for a violent crime is forecasted.Forecasting importance for each predictor is shown. For example, when thenumber of prison misconduct charges is shued, accuracy declines approxi-mately 4 percentage points (e.g., from 60% accurate to 56% accurate).

    This may seem small for the most important predictor, but because ofassociations between predictors, there is substantial forecasting power thatcannot be cleanly attributed to single predictors. Recall that the use ofclassification trees in random forests means that a large number of interactionterms can be introduced. These are product variables that can be highlycorrelated with their constituent predictors. In short, the goal of maximizingforecasting accuracy can comprise subject-matter explanation.

    Still, many of the usual predictors surface with perhaps a few surprises.For example, age and gender matter just as one would expect. But behaviorin prison is at least as important. Parole risk instruments have in the pastlargely neglected such measures perhaps because they may be only indi-cators, not real causes. Yet for forecasting purposes, behavior in prisonlooks to be more important by itself than prior record. And the widely usedLSIR adds nothing to forecasting accuracy beyond what the other predictorsbring.

    17

  • Forecasting Importance of Each Predictor for Violent Crime (Level 2)

    Increase in Forecasting Error

    LSIR Score

    Number of Prior Convictions

    IQ Score

    Number of Prior Arrests

    Age at First Arrest

    Prison Programming Compliance

    Nominal Sentence Length

    Recent Category 1 Count

    Prison Work Performance

    Violence Indicator

    High Crime Neighborhood

    Male

    Recent Reports Count

    Age

    Charge Record Count

    0.01 0.02 0.03 0.04

    Figure 3: Predictor importance measured by proportional reductions in fore-casting accuracy for violent crimes committed within 2 years of release onparole.

    18

  • 4.3 Partial Response Plots

    The partial response plots that one can get from random forecasts and othermachine learning procedures can also be descriptively helpful. The plotsshows how a given predictor is related to the response with all other predictorsheld constant. An outline of the algorithm is as follows.

    1. A predictor and a response class are chosen. Suppose the predictor isIQ and the response class is an arrest for a violent crime.

    2. For each case, the value of IQ is set to one of the IQ values in thedataset. All other predictors are fixed at their existing values.

    3. The fitted values of are computed for each case, and their mean calcu-lated.

    4. Steps 2 and 3 are repeated for all other IQ values.

    5. The means are plotted against IQ.

    6. Steps 2 through 5 are repeated for all other response variable classes.

    7. Steps 1 through 6 are repeated for all other predictors.

    Figure 4 shows how IQ is related to commission of a violent crime whileon parole, all other predictors held constant. The vertical axis is in centeredlogits. Logits are used just as in logistic regression. The centering is employedso that when the outcome has more than two classes, no single class need bedesignated as the baseline.13 A larger value indicates a greater probabilityof failure.

    IQ as measured in prison has a nonlinear relationship with the log oddsare being arrested for a violent crime. There is a strong, negative relationshipfor IQ scores from about 50 to 100. For higher IQ scores, there no apparentassociation. Some might have expected a negative association in general,but there seems to be no research anticipating a nonlinear relationship of thekind shown in Figure 4.

    19

  • 60 80 100 120 140 160

    -0.30

    -0.25

    -0.20

    -0.15

    -0.10

    -0.05

    0.00

    Response to IQ

    IQ

    Cen

    tere

    d Lo

    git f

    or L

    evel

    Vio

    lent

    Crim

    e

    Figure 4: How inmate IQ is related to whether a violent crime is committedwhile on parole. The relationship is negative for below average IQ scores andflat thereafter.

    5 Other Tree-Based Algorithms

    For a variety of reasons, random forests is a particularly eective machinelearning procedure for criminal justice forecasting [16]. But there are atleast two other tree-based methods than can also perform well: stochasticgradient boosting [26, 27] and Bayesian additive regression trees [28]. Bothare computer intensive and algorithmic in conception.

    5.1 Stochastic Gradient Boosting

    The core idea in stochastic gradient boosting is that one applies a weaklearner over and over to the data. After each pass, the data are reweightedso that observations that are more dicult to accurately classify are givenmore weight. The fitted values from each pass through the data are used toupdate earlier fitted values so that the weak learner is boosted to becomea strong learner. Here is an outline of the algorithm.

    Imagine a training dataset in which the response is binary. Suppose failis coded as 1 and succeed is coded as 0.

    20

  • 1. The algorithm is initialized with fitted values for the response. Theoverall proportion of cases that fail is a popular choice.

    2. A simple random sample of the training data is drawn with a samplesize about half the sample size of the training data.14

    3. The negative gradient, also called pseudo residuals, is computed.Just like with conventional residuals, each fitted value is subtractedfrom its corresponding observed value of 1 or 0. The residual willbe quantitive not categorical: (1 p) or p, where p is the overallproportion coded as 1.

    4. Using the randomly-selected observations, a regression tree is fit to thepseudo residuals.15

    5. The conditional mean in each terminal node serves as an estimate ofthe probability of failure.

    6. The fitted values are updated by adding to the existing fitted valuesthe new fitted values weighted to get the best fit.

    7. Steps 2 through 6 are repeated until the fitted values no longer im-prove a meaningful amount. The number of passes can in practice bequite large (e.g., 10,000), but unlike random forests, stochastic gradientboosting can overfit [27]. Some care is needed because there is formallyno convergence.

    8. The fitted probability estimates can be used as is, or with the Bayesclassifier transformed into assigned classes.

    Stochastic gradient boosting handles wide range of response variable typesin the spirit of the generalized linear model and more. Its forecasting per-formance is comparable to the forecasting performance of random forests.A major current liability is the requirement of symmetric loss functions forcategorical response variables.

    5.2 Bayesian Additive Regression Trees

    Bayesian additive regression trees [28] is a procedure that capitalizes on anensemble of classification (or regression) trees is a clever manner. Random

    21

  • forests generates an ensemble of trees by treating the tree parameters asfixed but the data as random data and predictors are sampled. Stochasticgradient boosting proceeds in an analogous fashion. Bayesian additive treesturns this upside-down. Consistent with Bayesian methods more generally,the data are treated as fixed once they are realized, and tree parameters aretreated as random the parameters are sampled. Uncertainty comes fromthe parameters, not from the data. Another dierence is that one needs amodel well beyond the joint probability distribution model. That model isessentially a form of linear regression. The outcome is regressed in a specialway on a linear additive function of the fitted values from each tree combinedwith an additive disturbance term [29].

    The parameter sampling takes four forms:

    1. Whether or not to consider any partition of a node is determined bychance in a fashion that discourages larger trees;

    2. if there is to be a split, the particular partitioning is determined bychance;

    3. the proportion for each terminal node is selected at random from adistribution of proportions; and

    4. the overall probability for each class is selected at random from a dis-tribution of proportions.

    The result can be a large ensemble classification trees (e.g., 300) that onemight call a Bayesian forest.16 The forest is intended to be a representativesample of classification trees constrained so that trees more consistent withprior information are more heavily represented.

    The growing of Bayesian trees is embedded in the algorithm by which thefitted values from the trees are additively combined. The algorithm, a formof backfitting [30], starts out with each tree in its simplest possible form.The algorithm cycles though each tree in turn making it more complicated asneeded to improve the fit while holding all other trees fixed at their currentstructure. Each tree may be revisited and revised many times. The processcontinues until there is no meaningful improvement in the fitted values. Onecan think of the result as a form of nonparametric regression in which a linearcombination of fitted values is constructed, one set from each tree. In thatsense, it is in the spirit of boosting.17

    22

  • If one takes the model and the Bayesian apparatus seriously, the approachis not longer algorithmic. If one treats the model as a procedure, an algo-rithmic perspective is maintained. The perspective one takes can make adierence in practice. For example, if the model parameters are treated astuning parameters, they are of little substantive interest and can be directlymanipulated to improve performance. They are a means to an end, not anend in themselves.

    Forecasting performance for Bayesian trees seems to be comparable toforecasting performance for random forests and stochastic gradient boosting.However, a significant weakness is that currently, categorical outcomes arelimited to two classes. There is work in progress to handle the multinomialcase. Another weakness is an inability to incorporate asymmetric loss, buthere too there may soon be solutions.

    6 Statistical Inference for Tree-Based MachineLearning

    Even when the training data are treated as a random realization from naturesjoint probability distribution, conventional statistical tests and confidenceintervals are problematic for random forests and stochastic gradient boost-ing. Bayesian additive regression trees raises dierent issues to be brieflyaddressed shortly.

    Consider statistical inference for forecasts, which figure so centrally in ourdiscussion. In conventional practice, forecasting confidence intervals can bevery useful. There is a model representing how outcome probabilities and/orclasses are generated. That model specifies the correct functional form anddisturbance distribution, and typically treats the predictors as fixed. A 95%confidence interval will cover each true probability 95% of the time overrandom realizations of the response variable. There can be similar reasoningfor the outcome classes themselves.

    The conventional formulation does not apply under the joint probabilitydistribution model. There can be a true probability for every outcome classthat one would like to estimate. There can be a true class, also a potentialestimation target. But, there are no credible claims that estimates of eitherhave their usual convenient properties. In particular, they not assumed tobe an unbiased or even consistent estimates.

    23

  • For reasons given at the beginning of Section 3, f(X) is taken to be someapproximation of f(X) that can contain both systematic and random error.Biased estimated are essentially guaranteed. When there is bias, confidenceintervals do not have their stated coverage and test statistics computed underthe null hypothesis do not have their stated probabilities.

    In a forecasting setting, there is nevertheless the prospect of appropriate95% error bands. One takes the machine learning results as fixed, as theywould be in a forecasting exercise, and considers bands around the fitted val-ues that would contain 95% of the forecasts. Work is under way on how toconstruct such bands, and there is no doubt useful information in the resid-uals or a bootstrap using those residuals [31]. There are also useful, thoughless complete, approaches that can be applied to random forests in partic-ular. The votes over trees provide some purchase on uncertainty associatedwith a forecasted class [16].

    Bayesian additive regression trees can generate a predictive distributionof the fitted probabilities for either outcome class. These probabilities aretreated like much like another set of parameters whose values are unknownbut can be characterized by a particular distribution. Bayesian forecastingintervals then can be constructed [32]. One can determine, for instance, therange in which the middle 95% of the forecasted probabilities fall. And byplacing a threshold through these probabilities at an appropriate location(e.g., .50), probabilities can be transformed into classes. However, one mustnot forget that the uncertainty being represented comes from uncertaintyin the parameters that influence how the trees are grown. These dependon priors that can seem to some as fictions of convenience. In addition,some may not favor the Bayesian approach to begin with. Nevertheless, analgorithmic interpretation can still be appropriate and then forecasts andforecast uncertainty can be addressed in much the same fashion as for otherkinds of tree-based machine learning.

    7 Conclusions

    Random forests, stochastic gradient boosting, and Bayesian additive treesare very dierent in conception and implementation. Yet in practice, theyall can forecast well and typically much better than conventional regressionmodels. Is there something important these tree-based method share beyondthe use of large number of classification trees?

    24

  • The use of tree ensembles can be viewed more abstractly as a way toeectively search a large predictor space for structure. With a large numberof trees, the predictor space is sliced up in many dierent ways. Some setsof partitions will have stronger associations with the response variable thanothers and in the end, will have more weight in the forecasts that result.From this point of view, the subject-matter meaning of the predictors isa secondary concern. The predictors serve as little more than very high-dimensional coordinates for the predictor space.

    Ensembles of classification trees are eective search engines for that spacebecause of the following features that tree-based methods can share.

    1. Using natures joint probability distribution, compared to a regressioncausal model, as an account of how the data were generated removes arange of complications that are irrelevant for forecasting and otherwiseput unnecessary constraints on the predictor-space search.

    2. The use of step functions as each tree is grown can produce a very largenumber of new predictors. A single predictor such as age, might ulti-mately be represented by many indicator variables for dierent breakpoints and many indicators for interaction eects. A search using,for example, 20 identified predictors such as gender and prior record,may be implemented with several hundred indicator variables. As aresult, information in the initial 20 predictors can be more eectivelyexploited.

    3. The use of indicator variables means that the search can arrive induc-tively at highly nonlinear relationships and very high order interactions,neither of which have to be specified in advance. Moreover, becauseany of the original predictors or sets of predictor can define splits dif-ferently over dierent trees, nonlinear relationships that are not stepfunctions can be smoothed out as needed when the trees are aggregatedto more accurately represent any associations.

    4. When some form of random sampling is part of the algorithm whether sampling of the data or sampling of the parameters thecontent of the predictor space or the predictor space itself will vary[33]. Structure that might be masked for one tree might not be maskedfor another.

    25

  • 5. Aggregating fitted values over trees can add stability to forecasts. Ineect, noise tends to cancel out.

    6. Each of these assets are most evident in forecasting procedures builtfrom large data sets. The high-dimensional predictor space needs lotsof observations to be properly explored, especially because much of thesearch is for associations that one by one can be small. Their impor-tance for forecasting materializes when the many small associations areallowed to contribute as a group. Consequently, training data samplesizes in the hundreds creates no formal problems, but the power of ma-chine learning may not be fully exploited. Ideally, samples sizes shouldbe at least in the 10s of thousands. Sample sizes of 100,000 or more arestill better. It is sometimes surprising how much more accurate fore-casts can be when the forecasting procedure is developed with massivedatasets.

    In summary, when subject-matter theory is well-developed and the train-ing data set contains the key predictors, conventional regression methods canforecast well. When existing theory is weak or available data are incomplete,conventional regression will likely perform poorly, but tree-based forecastingmethods can shine.18

    There is growing evidence of another benefit from tree-based forecastingmethods. Overall forecasting accuracy is usually substantially more than thesum of the accuracies that can be attributed to particular predictors. Tree-based methods are finding structure beyond what the usual predictors canexplain.

    Part of the reason is the black-box manner in a very large number of newpredictors are generated as part of the search process. A new linear basiscan be defined for each predictor and various sets of predictors. Anotherpart of the reason is that tree-based methods capitalize on regions in whichassociations with the response are weak. One by one, such regions do mattermuch, and conventional regression approaches bent on explanation mightproperly choose to ignore them. But when a large number of such regionsis taken seriously, forecasting accuracy can dramatically improve. There isimportant predictive information in the collection of regions, not in eachregion by itself.

    An important implication is that there is structure for a wide varietyof criminal justice outcomes that current social science does not see. Inconventional regression, these factors are swept into the disturbance term

    26

  • that, in turn, is assumed to be noise. Some will argue that this is a necessarysimplification for causal modeling and explanation, but it is at least wastefulfor forecasting and means that researchers are neglecting large chunks of thecriminal justice phenomena. There are things to be learned from the darkstructure that tree-based, machine learning shows to be real, but whosenature is unknown.

    Notes1 A recent critique of this approach and a discussion of more promising, model-based

    alternatives can be found in [14].2 Usual criminology practice begins with a statistical model of some criminal justice

    process assumed to be have generated the data. The statistical model has parameterswhose values need to be estimated. Estimates are produced by conventional numericalmethods. At the other extreme are algorithmic approaches found in machine learningand emphasized in this paper. But, there can be hybrids. One may have a statisticalmodel that motivates a computer-intensive search of a dataset, but there need be nodirect connection between the parameters of the model and the algorithm used in thatsearch. Porter and Brown [17] use this approach to detect simulated terrorist hot spotsand actual concentrations of breaking and entering in Richmond, Virginia.

    3 Normal regression, Poisson regression, and logistic regression are all special casesof the generalized linear model. There are other special cases and close cousins such asmultinomial logistic regression. And there are relatively straightforward extensions tomultiple equation models, including hierarchical models. But in broad brush strokes, themodels are motivated in a similar fashion.

    4 A very instructive illustration is research claiming to show that the death penaltydeters crime. A recent National Research Council report on that research [18] is devastat-ing.

    5 Recall that in a sample, the sum of the deviation scores around a mean or proportion(coded 1/0) is zero.

    6 The notation f(X) is meant to represent some function of the predictors that willvary depending on the contect.

    7 Sometimes, the training data used to build the model and the forecasting data forwhich projections are sought are probabilitiy samples from the same finite population.There is still a model of how the data were generated, but now that model can be de-mostrably correct. The data were generated by a particular (known) form of probabilitysampling.

    8 One might argue that the two are suciently alike. But then one is saying that thetwo data generations processes are similar enough to be treated as the same.

    9 In the interest of space we do not consider the order in which the partitions shown inFigure 1 were constructed. One particular order would lead precisely to the classificationtree in Figure 2.

    27

  • 10 Roughly speaking, Breimans generalization error is the probability that a case will beclassified incorrectly in limitless number of independent realizations of the data. Breimanprovides an accessible formal treatment [24] in his classic paper on random forests. Onemust be clear that this is not generalization error for the right response and predic-tors. There is no such thing. The generalization error is for the particular response andpredictors analyzed.

    11 Although the number of trees is a tuning parameter, the precise number of trees doesnot usually matter as long as there are at least several hundred. Because random forestsdoes not overfit, having more trees than necessary is not a serious problem. 500 treestypically is an appropriate number.

    12 In more conventional language, the model is fixed. It is not reconstructed as eachpredictor in turn is excluded from the forecasting exercise. This is very dierent fromdropping each predictor in turn and regrowing the forest each time. Then, both the forestand the eective set of predictors would change. The two would be confounded. The goalhere is to characterize the importance of each predictor for a given random forest.

    13 For any fitted value, the vertical axis units can be expressed as fk(m) = logpk(m)1K

    PKl=1 logpl(m). The function is the dierence in log units between the proportion for

    outcome class k computed at the value m of a given predictor and the average proportionat that value over the K classes for that predictor.

    14 This serves much the same purpose as the sampling with replacement used in randomforests. A smaller sample is adequate because when sampling without replacement, no caseis selected more than once; there are no duplicates.

    15 The procedure is much the same as for classification trees, but the fitting criterion isthe error sum of squares or a closely related measure of fit.

    16 As before, the number of trees is a tuning parameter, but several hundred seems tobe a good number.

    17 The procedure is very computationally intensive because each time fitted values arerequired in the backfitting process, a posterior distribution must be approximated. Thisleads to the repeated use of an MCMC algorithm that by itself is computer intensive.

    18 Another very good machine learning candidate is support vector machines. There isno ensemble of trees. Other means are employed to explore the predictor space eectively.Hastie and his colleagues [12] provide an excellent overview.

    References

    1. Gro E, Mazerolle L (eds.): Special issue: simulated experimentsin criminology and criminology justice. Journal of ExperimentalCriminology 2008, 4(3).

    2. Liu L, Eck J (eds.): Artificial crime analysis: using computer simula-tions and geographical information systems IGI Global; 2008.

    28

  • 3. Brantingham PL: Computational criminology. Intelligence and Se-curity Infomatics Conference 2011, European.

    4. Berk RA: How you can tell if the simulations in computationalcriminology are any good. Journal of Experimental Criminology2008, 3: 289308.

    5. Cook D, Swayne DF: Interactive and dynamic graphics for data analysisSpringer; 2007.

    6. Wickham H: ggplot: elegant graphics for data analysis Springer; 2009.

    7. Berk RA, MacDonald J: (2009) The dynamics of crime regimes.Criminology 2009, 47(3): 9711008.

    8. Brown DE, Hagan S:Data association methods with applicationsto law enforcement. Decision Support Systems 2002, 34: 369378.

    9. Hoaglin DC, Mosteller F, Tukey, J (eds.): Exploring data tables, trends,and shapes. John Wiley; 1985.

    10. Berk RA: Statistical learning from a regression perspective. Springer;2008.

    11. Bishop C: Pattern recognition and machine learning Springer; 2006.

    12. Hastie T, Tibshirani R, Friedman J: The elements of statistical learningSecond Edition. Springer; 2009.

    13. Marsland S: Machine learning: an algorithmic perspective CRC Press;2009.

    14. Berk RA, Brown L, George E, Pitkin E, Traskin M, Zhang K, ZhaoL: What you can learn from wrong causal models. Handbook ofcausal analysis for social research, S. Morgan (ed.) Springer; 2012.

    15. Breiman L: Statistical modeling: two cultures. (with discussion)Statistical Science 2001, 16: 199231.

    16. Berk RA: Criminal justice forecasts of risk: a machine learning ap-proach Springer; 2012.

    29

  • 17. Porter MD, Brown DE:Detecting local regions of change in high-dimensional criminal or terrorist point Processes. Computa-tional Statistics & Data Analysis 2007, 51: 27532768.

    18. Nagin DS, Pepper JV: Deterrence and the death penalty. Committeeon Law and Justice; Division on Behavioral and Social Sciences andEducation; National Research Council; 2012.

    19. Berk RA: Regression analysis: a constructive critique. Sage publica-tions; 2003.

    20. Berk RA, Sherman L, Barnes G, Kurtz E, Ahlman L: Forecastingmurder within a population of probationers and parolees: ahigh stakes application of statistical learning. Journal of theRoyal Statistics Society Series A 2009, 172 (part I): 191211.

    21. Berk RA, Sorenson SB, He Y: Developing a practical forecast-ing screener for domestic violence incidents. Evaluation Review2005, 29(4): 358382.

    22. Berk RA, Kriegler B, Baek J-H: Forecasting dangerous inmatemisconduct: an application of ensemble statistical procedures.Journal of Quantitative Criminology 2006, 22(2) 135145.

    23. Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification andregression trees Wadsworth Press; 1984.

    24. Breiman L: Random forests. Machine Learning 2001, 45: 532.

    25. Berk RA: The role of race in forecasts of violent crime. Raceand Social Problems 2009, 1: 231242.

    26. Friedman JH: Stochastic gradient boosting. Computational Statis-tics and Data Analysis 2002, 38: 367378.

    27. Mease D, Wyner AJ, Buja A:Boosted classification trees and classprobability/quantile estimation. Journal of Machine Learning Re-saerch 2007, 8: 409439.

    28. Chipman HA, George EI, McCulloch RE: BART: bayesian additiveregression trees. Annals of Applied Statistics 2010, 4(1): 266298.

    30

  • 29. Chipman, HA, George, EI, McCulloch, RE (2007). Bayesian Ensem-ble Learning. Advances in Neural Information Processing Systems 192007: 265-272 .

    30. Hastie TJ, Tibshirani RJ: Generalized additive models Chapman &Hall; 1990.

    31. Efron BJ, Tibshirani RJ: An introduction to the bootstrap Chapman &Hall; 1993.

    32. Gweke, J, Whiteman, C Bayesian Forecasting Handbook of Fore-casting, Volume 1, G Elliot, CWJ Granger, A. Timmermann (eds.)Elsevier; 2006.

    33. Ho TK: The random subspace method for constructing deci-sion trees. IEEE Transactions on Pattern Recognizition and MachineIntelligence 1998, 20 (8) 832844.

    31