instance construction via likelihood-based data squashing

Post on 03-Jan-2016

32 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Instance Construction via Likelihood-Based Data Squashing. Madigan D., et. al . (Ch 12, Instance selection and Construction for Data Mining (2001), Kruwer Academic Publishers) Summarize: Jinsan Yang, SNU Biointelligence Lab. Abstract Data Compression Method: Squashing - PowerPoint PPT Presentation

TRANSCRIPT

Instance Construction via Likelihood-Based Data Squashing

Madigan D.,Madigan D., et. al. (Ch 12, (Ch 12, Instance selection and Construction for Data MiningInstance selection and Construction for Data Mining (2001), (2001), Kruwer Acade

mic Publishers)

Summarize: Jinsan Yang, SNU Biointelligence Lab

AbstractData Compression Method: Squashing

LDS: Likelihood based data squashing

Keywords Instance Construction, Data Squashing

Outline

IntroductionThe LDS AlgorithmEvaluation: Logistic Regression Evaluation: Neural NetworksIterative LDSDiscussion

Introduction Massive data examples

Large-scale retailingTelecommunicationsAstronomyComputational biologyInternet logging

Some computational challengesNeed of multiple passes for data access10^5~6 times slower than main memoryCurrent Solution:Scaling up existing algorithmHere: Scaling down the data

Data squashing: 750000 8443 ( DuMouchel et al (1999), Outperforms by a factor of 500 in MSE than random sample of size 7543

LDS Algorithm Motivation: Bayesian rule

Given three data points d1,d2,d3, estimate the parameter :

Clusters by likelihood profile:

)()|()|()|(),,|( 321321 pdpdpdpdddp

)|()|()|(,

),|()|(

212**

21

21

dpdpdpwithdbyddsquash

dpdpIf

))|((,),|((( 1 kii dpdp

LDS Algorithm Details of LDS Algorithm

[Select] Values of by a central composite design

Central composite Design for 3 factors

LDS Algorithm

[Profile] Evaluate the likelihood profiles

[Cluster] Cluster the mother data in a single pass- Select n’ random samples as initial cluster centers

- Assign the remaining data to each cluster

[Construct] Construct the Pseudo data:

cluster center

Evaluation: Logistic Regression•Small-scale simulations:

•Initial estimate of

•Plot: Log (Error Ratio)

•Three methods of initial parameter estimations

•100 data / 48 squashed data

5544332211

)1(1

)1(log

XXXXX

yp

yp

Evaluation: Logistic Regression Medium Scale: 100000 , base: 1% simple random sampling

Evaluation: Logistic Regression Large Scale: 744963 , base: 1% simple random sampling

Evaluation: Neural Networks Feed forward, two input nodes, one hidden layer with 3 units,

Single binary output

Mother data: 10000, Squashed data: 1000, repetitions:30

test data: 1000 from the same network

Comparisons for P(whole) - P(reduced)

Evaluation: Neural Networks

Iterative LDS

When the estimation of is not accurate.

1. Set from simple random sampling

2. Squash by LDS

3. Estimate

4. Go to 2.

top related