tree analysis – a method for constructing edit groups work session on statistical data editing...

Post on 04-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tree Analysis – A Method for Constructing Edit Groups

Work Session on Statistical Data EditingOslo, Norway, 24-26 September 2012By Anders Norberg, Statistics Sweden

1.

Said about Tree Analysis

• Trees do not supersede other modeling techniques

• Different techniques do better with different data and in the hands of different analysts

• However, the winning technique is generally not known until all the contenders get a chance

• Trees are easy

A Tree

Trees provide a series of if-then rules.

Each rule asigns an observation to one segment of a the tree, at which point another if-then rule is applied.

The initial segment, containing the entire data set, is the root node for the tree. The final nodes are called leaves. Intermideate nodes (a node plus all its successors) form a bransch of the tree.

The Root

We have a dataset, preferably large, here a sample of white collar workers 2008.

One variable is considered dependent, here Salary per hour

AllN = 684 366Average = 196,21 SEK

First Split

The dataset is split into two by a simple rule, containing one auxiliary/explanatory variable.If 1<=Occup<=3 then Node=2; else Node=1;

Expl = 10,5%

AllN = 684 366Ave. = 196,21

Occup = 1-3N = 487 812Ave. = 219,67

Occup = 4-9N = 196 554Ave. = 138,00

Second Split

One of the two new datasets is split by the same method. If 1<=Occup<=3 then do; if Occup=1 then Node=4; else Node=3; end;

Expl = 10,5%

Expl =5,9%

AllN = 684 366Ave. = 196,21

Occup = 1-3N = 487 812Ave. = 219,67

Occup = 1N =72 957Ave. = 298,00

Occup = 2-3N = 297 995Ave. = 205,90

Occup = 4-9N = 196 554Ave. = 138,00

Do it again and again…

Expl = 10,5%

Expl = 5,9%

Expl = 1,2% Expl = 2,9%

0,8% 1,1% 1,0% 0,5%

AllN = 684 366Ave. = 196,21

Occup = 1-3N = 487 812Ave. = 219,67

Occup = 1N =72 957Ave. = 298,00

Occup = 2-3N = 297 995Ave. = 205,90

Occup = 3N = 232 453Ave. = 191,71

Occup = 2N = 182 402Ave. = 223,98

Occup = 223, 224, 231-235, 243-246N = 38 044Ave. = 178,85

Occup = 'rest'N = 144 358Ave. = 235,87

Occup = 'rest'N = 69 383Ave. = 284,38

Occup = 121N = 3 574Ave. = 562,37

NUTS = 1N = 1 332Ave. = 700,31

NUTS > 1N = 2 242Ave. = 480,42

SNI1 = G, A, O, E, I, S, R, H, Q , P, NN = 26 864Ave. = 239,50

SNI1 = K, J, B, C, F,M, D ,L N = 42 519Ave. = 312,73

Occup = 4-9N = 196 554Ave. = 138,00

Gender = WomenN = 96 427Ave. = 170,47

Gender=MenN = 136 026Ave. = 206,76

…and again

Occup = 2-3N = 297 995Ave. = 205,90

AllN = 684 366Ave. = 196,21Expl = 10,5%

Occup = 1-3N = 487 812Ave. = 219,67Expl = 5,9%

Occup = 1N =72 957Ave. = 298,00Expl = 2,9%

Occup = 2-3N = 297 995Ave. = 205,90Expl = 1,2%

Occup = 3N = 232 453Ave. = 191,71Expl = 0,8%

Occup = 2N = 182 402Ave. = 223,98Expl = 1,1%

Occup = 223, 224, 231-235, 243-246N = 38 044Ave. = 178,85Expl = 0,1%

Occup = 'rest'N = 144 358Ave. = 235,87Expl= 1,0%

Occup = 'rest'N = 69 383Ave. = 284,38Expl = 1,0%

Occup = 121N = 3 574Ave. = 562,37Expl = 0,5%

NUTS = 1N = 1 332Ave. = 700,31Expl = 0,4%

NUTS > 1N = 2 242Ave. = 480,42Expl = 0,1%

Age = 18-29N = 46 965Ave. = 120,39Expl = 0,0%

Age = 30-65N = 149 589Ave. = 143,53Expl = 0,2%

Age = 30-65N = 127 073Ave. = 244,89Expl = 0,6%

Age = 18-29N = 17 285Ave. = 169,58Expl = 0,0%

SNI1 = G, A, O, E, I, S, R, H, Q , P, NN = 26 864Ave. = 239,50Expl = 0,3%

SNI1 = K, J, B, C, F,M, D ,L N = 42 519Ave. = 312,73Expl = 0,8%

SNI1 not 'K'N = 1 210Ave. = 646,34Expl = 0,1%

SNI1= 'K'N = 122Ave. = 1 235,59Expl = 0,1%

Occup = 4-9N = 196 554Ave. = 138,00Expl = 0,2%

Gender = WomenN = 96 427Ave. = 170,47Expl = 0,2%

Gender=MenN = 136 026Ave. = 206,76Expl = 1.1%

SNI1 not 'K'N = 119 114Ave. = 2196,77Expl = 0,4%

SNI1 = 'K'N = 16 912Ave. = 277,14Expl = 0,6%

NUTS > 1N = 71 878Ave. = 226,65Expl = 0,2%

NUTS = 1N = 55 195Ave. = 268,64Expl = 0,2%

Age = 18-29N = 13 578Ave. 150,42Expl = 0,1%

Age = 30-65N = 105 536Ave. = 202,73Expl = 0,3%

NUTS > 1N = 8 603Ave. = 222,34Expl= 0,1%

NUTS = 1N = 8 309Ave. = 333,89Expl= 0,4%

NUTS > 1N = 30 198Ave. = 286,17Expl = 0,2%

NUTS = 1N = 12 321Ave. = 377,84Expl = 0,3%

Age = 18-39N = 3 358Ave. 303,17Expl = 0,0%

Age = 40-65N = 8 963Ave. 405,81Expl = 0,2%

Occup = 123N = 15 920Ave. 306,14Expl = 0,1%

Occup = 'rest'N = 14 278Ave. 263,91Expl = 0,1%

NUTS > 1N = 16 532Ave. = 216,85Expl = 0,1%

NUTS = 1N = 10 332Ave. = 275,73Expl = 0,2%

This Tree

• Criterion for best split is minimization of within groups sums of squares around mean

• Has 20 leaves

• Explains 30% of the total sum of squares in the data

• Leaves can be used as edit groups

Method

• Auxiliary variables are of four scales;– Nominal– Ordinal– Bivariate– Interval

• Splitting should be stopped when the analysis detects that no further gain can be made, or some pre-set stopping rules are met.

• Alternatively, the data are split as much as possible and then the tree is later pruned.

• Manual intervention is possible

1963

top related