section 10 – ec1818 jeremy barofsky [email protected] april 14 and 15, 2010

Section 10 – Ec1818

Jeremy [email protected]

April 14 and 15, 2010

Outline (Lectures 18 and 19)

• Finish Tree Models

• Previous Papers / Discussion of Paper Topics

• Benford’s Law

• Prediction of Markets / Wisdom of Crowds– Information Cascades (Example)

Tree Data Mining Models Definition• Tree models: Classify data by partitioning it into

groups that minimize variance of the output variable.

• Made up of root and terminal stems – rules to place observations in classes.

• Useful for classifying objects into groups with nonlinear interactions, less effective for linear relationships.

• Employs local data only, once data partitioned into two sub-samples at the root node, each half of tree analyzed separately.

• CART – Classification and regression tree analysis (created by statistician Leo Breiman)

Simple Example: Predicting Height-Split root node by minimizing variance to create tree. -Height prediction mean of heights in terminal node.-Minimize varianceby splitting into M/F, then L/R.-Draw isomorphism between partitioning diagram and tree.

Gender Hand Height

M R 1

F R 0.6

M L 0.9

F R 0.7

F L 0.8

Example: Fast and Frugal Heuristics• Patient comes in to ER with severe chest pain and

physicians must decide whether patient is high-risk and admit to coronary care unit or low risk.

• Models and most ERs use 19 different clues (bp, age, ethnicity, comorbidities, etc..)

• Breiman uses simple decision tree with 3 questions: Systolic BP > 91 {No – High Risk} {Yes – Is age > 62.5} {No – Low Risk} {Yes – Elevated heart rate? {Yes- High Risk} {No – Low Risk}

• More accurate at classification than many statistical models!!

• (Breiman, et al 1993 and Gigerenzer and Todd, 1999)

Solve Overfitting in Trees• Bootstrap cross-validation: If not enough data to split

between training and test dataset, remove percent of data randomly and get fit of model. Do this 1000 times and get average of model fit tests.

• Pruning: CART developed because no optimal rule for stopping partitions. Grow large tree from training data. Prune back based on optimizing fit to test data sets.

• Add complexity cost: Loss function that we want to minimize (before just error of predictions) now depends on errors and number of nodes in model. Say cost = a (# Nodes) + Price per node. Keep adding nodes until d (Misclassification) / d (# Nodes) = Price of additional node + a. .

Random Forests

• Randomly sample part of observations. Fit one tree to random sample. Sample randomly again and fit another tree. Take “vote” of trees (how many trees classify patient as high-risk?) and classify new observation based on majority vote.

• Also avoids overfitting – tree model that overfits gets outvoted.

• Prediction error for a forest decreases with a smaller ratio of correlation between trees / prediction strength^2. (Like Wisdom of Crowds).

Paper Title: How well do prediction markets for American Idol perform?• Question: Compare the accuracy of betting markets,

expert opinion, and one online poll in predicting outcomes of Seasons 5, 6, and 7 of American Idol.

• Data: Gambling website bodog.com for odds information by contestant and season, rankings by well-known commentator, and polling website Votefair.com.

• Methods: Check if odds data follow Zipf’s law by week, compare actual rank at end of show to rank implied by odds for each week, calculate RMSE for errors in ranks.

• Results: Weeks 1-3 odds exhibit Zipf’s law, odds early in the season are poor predictors of final winner, none of the methods are better than others in RMSE over weeks, but overall slightly better for prediction markets.

Paper Title: Freshman Rooming at Harvard

• Question: How successful is Harvard freshman rooming procedure?

• Data: None collected, paper describes housing process, how to measure success, and what data would be needed to do this measurement.

• Methods/Results: Can’t measure success based on observed numbers of students that live together in future years. Instead, use the number of pair-wise links that individuals would have desired to keep as compatibility parameter. Total number of links for N roommates = N(N-1) / 2. Would collect with survey. Describe process of binomial model with p = probability that any link preserved at end of freshman year.

Paper Title: How to fit a power law distribution for the richest and poorest• Question: Can we characterize richest distribution of

individual’s income as following a Pareto distribution?• Data: Annual data on billionaires from Forbes magazine

1988-2008. World Bank’s “Global and Regional Poverty Trends” data - % under $1/$2 per day per nation.

• Methods: Describes problems of binning data to determine if power law exists from Newman, 2005. Uses log-log plots of data overall, by nations, by industry. Looks at lower end too.

• Results: Higher ends follow Pareto, lower end doesn’t based on % under $1 a day per nation.

Benford’s Law• Astronomer Simon Newcomb (1881) observed

that scientists were using logarithm tables starting with 1 more often than others – some books with log look-up tables were dirtier than others.

• Prob( first digit = d) = log10(1+ 1/d) for d = {1,…,9}• Leads to conclusion that first digit is 1 - 30% of

the time, but 9 - 4.6% of time.• Physicist Frank Benford (1938) made same

observation with log tables, but also gathered data on baseball stats, atomic weights of elements, and areas of rivers.

D 1 2 3 4 5 6 7 8 9

P(D) (%) 30.1 17.6 12.5 9.7 7.9 6.7 5.8 5.1 4.6

Benford’s Law: Quran separated into 114 chapters of unequal length called Suras. Below is distribution of

number of first digits of number of verses per Sura.

Examples and General Law• Other examples found later:

– Stock market averages (Dow and S&P indices)– Numbers appearing on newspaper front pages– Populations of U.S. counties in 1990 census

• General Law: P(D1 = d1, D2 = d2, …D3 =d3) = log10[ 1 + 1/∑ di * 10k-i]Meaning that Prob(314) = log10[1+ 1/314]

• Decrease in relative frequency for 2nd digit from 0-9 like 1st, but less steep (more uniform). For 5th digit, seeing 1 is more likely than 9, but the probability differences are very small – approximately 1/10.

• Digit values are also not independent. Eg: unconditional probability of 2nd digit = 2 is 0.109 but probability conditional on 1st digit being 1 is 0.115 (Hill, 1998).

• Benford’s law distributed data is scale invariant.

Uses of Benford’s Insight• Uses include – Optimization of computer processing: if data computers

will process come from many unbiased random distributions then calculations will be performed on Benford/log-distributed data. Optimize processing power for the types of data (starting with 1’s) that the computer will receive.

– Fraud detection: filling-in survey or made-up numbers will have uniformly distributed first digits, but we know they should be log-distributed. Tax / Accounting Fraud.

– Separate out the fakers from not-fakers in class for a run of 200 coin flips.

– Used as evidence of fraud in Iranian election of 2009 (can’t reject null of voting data following Benford’s Law using Pearson’s Chi-Square test)!!!

Proof Problems and Hill’s theorem

• Proofs of Benford’s law ran into problems because:– Prove that only scale-invariant datasets fit the law– Follow countable additivity. P(1) + P(2) + … = 1.

• If scale invariant then P(1) = P(1*K) where K is any constant. But then P(1) + P(2) + … not = 1 but infinity. These two facts are irreconcilable.

• Until….Hill looked at base-invariance using equivalent sets of numbers.

Hill’s Theorem• Proof of Benford’s Law: if probability distributions are

selected at random and random samples are taken from those distributions (picking from many distributions / scales produces a scale-invariant result even when specific distributions aren’t scale-invariant), then significant digit frequencies of the combined sample converge to Benford’s distribution.

• Reason we see Benford distribution not because same distributions formed from some mystical underlying data process, but a random sampling from random distributions which produces Benford.

• Example: Take data on lottery numbers (uniform), heights (normal), and incomes (Pareto), then combining these distributions converges to Benford.

Galton’s Ox Contest

• At the West of England Fat Stock and Poultry Exhibition, a ox weight judging contest was held. Contestants could purchase a card to guess the ox’s weight. With 787 entries, the median weight was 1207 and the real weight was 1198. The crowd was 0.8% off!!!

• Interquartile range of estimates was 3.1% around the mean. Roughly normally distributed.

The Wisdom of Crowds (James Suroweicki, 2004)

• Characteristics of a wise crowd:1) Diversity of opinion (private information)

2) Independence 3) Decentralization 4) Aggregation Mechanism• Can break down when

- markets are wealth weighted- risk aversion

- weak incentives- traders can be non-representative

Prediction Markets

• Prediction markets are financial markets in which assets’ cash value depends on the outcome of an event in the future.

• Contracts written as paying $1 if event X occurs. We can interpret the market price of one $1 contract as the market’s prediction of event X’s probability.

• Efficient way to aggregate diverse opinions with well-aligned incentives.

Wisdom of Crowds / Should you ask the audience? (Paper topic?)• Why should work:– Provides incentives for information revelation– Compile information from all participants– Anonymous

• Requires that observations are uncorrelated over individuals. Otherwise herding/ informational cascades can occur.

• Actual interpretation of prices in markets requires knowledge of risk preferences of the participants.

• Often markets are wealth-weighted• NY Times Best-Seller List (Bikhchandani, Hirshleifer,

Welch, 1998, p. 151)

Informational Cascades

• Opposite of wisdom of crowds – occurs when individuals only or primarily use information from others to make my decision.

• Requires: --- sequential decisions and binary decisions, more likely with externalities and difficulty in producing an objective judgment to make decision.

• Game Example:

section 10 – ec1818 jeremy barofsky [email protected] april 14 and 15, 2010

Documents

training data

local data

percent of data

half of tree

large tree

data sets

overfitting tree model

fit of model