who nose what tomorrow brings?
DESCRIPTION
Who Nose What Tomorrow Brings?. David J. Weiss. Some Predictions from Swami Weiss. There will be a severe hurricane threat to Florida during the second week of October 2014 The LA Lakers will fail to make the playoffs in the 2013-2014 season - PowerPoint PPT PresentationTRANSCRIPT
Who Nose What Tomorrow Brings?
David J. Weiss
Some Predictions from Swami Weiss
There will be a severe hurricane threat to Florida during the second week of October 2014
The LA Lakers will fail to make the playoffs in the 2013-2014 season
The married folks in the audience will buy an expensive gift for their spouse during December 2014
During 2015, Michael Birnbaum will publish an excellent paper presenting results that cannot be accounted for by SEU or prospect theory
The Expert Forecaster
Weatherperson, sports bookmaker, investment advisor, intelligence analyst, safety engineer, personnel analyst, admissions director, marriage counselor, parole board member, custody judge
The first two professions make predictions about short-term, unitary events.
The others make predictions about events that will play out over a relatively long term. They also usually recommend actions based on their probability estimates.
Everyone Forecasts
Amateurs also make predictions. Are the professionals really expert? How can we tell?
Three General Approaches
Credentials Experience Performance-Based Assessment Scoring outcomes (prediction accuracy) This talk examines some of the challenges
in scoring outcomes (Hammond’s “Correspondence”)
Technical Matters
Specificity of the prediction Determining whether the predicted event did in fact
occur Duration of the observation period. For a prediction
that unfolds over time, the ultimate result can change These ambiguities can usually be resolved (Tetlock),
but in practice are often overlooked Announcing the prediction can affect the outcome
Scoring Index
Percent correct (batting average) is the easy solution, but
Scoring over the person’s predictions assumes the events are comparable
(as Moneyball highlighted, baseball batting average has a similar shortcoming)
Some of the Swami’s predictions may be more likely to come true than others
The Playing Field is Not Level
Base-rate differences Weather is easy to predict in Fullerton The road not taken Almost every applicant accepted by
Princeton graduates, not so at Cal State Would Princeton’s rejects have graduated
from Cal State?
Calibration Saves the Day?
When an expert repeatedly makes predictions of similar events, one can evaluate accuracy at a finer level.
Calibration imposes a different standard from batting average
“Well-calibrated” sounds like “expert” Is calibration a clever way to discount errors?
Remembering Our Founder
Experts and the public resist forecasts expressed in probabilistic terms
They may be right to do so, because calibration makes sense only when both kinds of errors are equally costly. As Ward kept telling us, utilities are the proper basis for decisions. This limitation applies similarly to sophisticated “skill scores” such as those of Murphy (1988) and Stewart and Lusk (1994).
Predictions also vary in importance
The “Bold Prediction”
If calibration overweights the mundane, then should we judge forecasters by how well they predict the spectacular? We might care more about predicting hurricanes than partly cloudy days
Be sure to address false alarms (crying “wolf” too often leads to being ignored)
Because rare events are, well, rare, they provide little data
Is Forecasting Even Possible?
Taleb’s turkey highlights the danger in using past results to predict future outcomes
But what else is there to guide us but the past? Perspective matters: When the event to be predicted
is under the control of a human (such as killing a turkey or planting a bomb), someone with knowledge about that human might be able to predict it without historical information.
An Expert Turkey
Learns about the farmer’s (or other farmers’) plans
Observes that there are no old turkeys around, and draws an inference
These methods do not use observations of the focal event (previous turkey beheadings) to predict the future.
Two Kinds of Environment (Taleb)
Mediocristan, where single observations do little to change the aggregate. Processes are stationary, and statistical models are descriptive.
Regression-based prediction works in Mediocristan. Experts can potentially use better prediction models. Better can mean either a more accurate model or better parameter estimates. One can learn from results.
The More Challenging Environment
Extremistan, where the total can be significantly impacted by a single observation
The world of the Black Swan, where the past is not a good guide to the future
Even successful prediction may represent a case of being fooled by randomness (no re-test is available)
The Environment and Evaluation
Scoring outcomes is feasible in Mediocristan. Probabilistic forecasts can be compared to observed relative frequencies.
Aggregation over instances is meaningful because the utilities are comparable.
But not so in Extremistan, where a single inaccurate prediction may be much more consequential than a host of accurate ones.
Probabilistic forecasts in Extremistan are opinions, and cannot be compared to observed frequencies
Which Environment Do Experts Inhabit?
Both, of course. But Extremistan is where some really important events reside, and where accuracy of previous predictions need not be informative.
While weather forecasting is certainly useful, it is not typical of the kind of forecast we get excited about, namely predicting the black swan. For example, the furor over the failure of US intelligence to anticipate 9/11 was an indictment of prediction in Extremistan.
Predicting back swans that are the result of human action calls for getting an insider’s perspective. Meehl’s “broken leg” cue, insider trading, infiltration
Sorry, Ken Hammond
Because scoring outcomes in Extremistan is so problematic, I suggest evaluating performance by examining coherence (process) instead.
The expert turkey was able to predict catastrophe by doing what looks like what intelligence officers are supposed to do – learning about plans (spying) and drawing causal inferences from observations.
Evaluating Coherence
If we think we know the correct process, we could evaluate the expert on the basis of adherence to that process.
If we lack that confidence, what can be done? We can examine the discrimination and consistency
in the predictions (you may have heard this before). All this really means is that predictions should be responsive to the relevant evidence.
Reality Check
It is unlikely that users of predictions will be satisfied to know that their forecaster discriminates and is consistent. People like correspondence. Most will demand a track record of accurate predictions.
Unfortunately, such a record is unlikely ever to be available in Extremistan.
Cart Before Horse
Millions of tax dollars are currently being spent on an IARPA project whose goal is to determine how best to aggregate predictions made by individual forecasters.
IMHO, that money is going down a well. Without better understanding of how to evaluate predictions in Extremistan, one cannot say whether one aggregation method is better than another. IARPA assumes all environments are Mediocristan.
Summary
Batting average, while basically appropriate for evaluating expert predictors in Mediocristan, should be improved by incorporating utilities
Calibration, a sophisticated version of batting average, might be generalized to include utilities.
Probabilities cannot be compared to observed frequencies in Extremistan. Predictors need to realize when they have crossed the border.