sorin alexe rutcor, rutgers university, piscataway, nj e-mail: [email protected]

31
III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: [email protected] URL: rutcor.rutgers.edu/~salexe Datascope - a new tool for Logical Analysis of Data (LAD) DIMACS Mixer Series, September 19, 2002

Upload: tyrone-harrison

Post on 02-Jan-2016

32 views

Category:

Documents


2 download

DESCRIPTION

DIMACS Mixer Series, September 19, 2002. Datascope - a new tool for Logical Analysis of Data (LAD). Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: [email protected] URL: rutcor.rutgers.edu/~salexe. Hidden Function. LAD Approximation. LAD - Problem. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 1

Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: [email protected] URL: rutcor.rutgers.edu/~salexe

Datascope - a new tool for Logical Analysis of Data (LAD)

Datascope - a new tool for Logical Analysis of Data (LAD)

DIMACS Mixer Series,September 19, 2002

Page 2: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 2

DatasetHidden

Function LAD

Approximation

LAD - ProblemLAD - Problem

Page 3: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 3

LAD - PatternsLAD - Patterns

Positive Pattern Negative Pattern

Page 4: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 4

LAD - Theories, Models, Classifications

LAD - Theories, Models, Classifications

Positive Theory Negative Theory

Model

Page 5: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 5

Datascope FunctionsDatascope Functions

Support Set IdentificationSpace DiscretizationPattern DetectionModel ConstructionDiscriminant / Prognostic IndexClassificationFeature Analysis

Page 6: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 6

Matlab Solver

InternalSolver

Datascope DataflowDatascope Dataflow

DiscretizationDiscretization

Significant Features

Cutpoints,Support Set

FeatureAnalysis

Pattern Space

DiagnosisPrognosis

RiskStratification

Pandect GenerationPandect Generation

Discriminant ConstructionDiscriminant Construction UserExcel Model

Pre-ProcessingPre-Processing

Raw Data

Theories/ModelsTheories/Models

Pattern Report

Page 7: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 7

1. Support Set Identification1. Support Set Identification

Selects Small Subset of Significant Features

Preserves Hidden Knowledge

Feature Ranking Criteria:

Statistical Correlation with Outcome

Combinatorial Entropy

Distribution Monotonicity

Class Separation

Envelope Eccentricity

E.g., 10 proteins selected out of

15,144

E.g., 10 proteins selected out of

15,144

Page 8: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 8

DataData

Spreadsheet OrientedOLE (via Clipboard)/ Excel Spreadsheet /

dBase tables

Training / Test GenerationBootstrapk-FoldingJackknife

New FeaturesCorrelation

Page 9: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 9

Data: Training/Test Data: Training/Test

Page 10: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 10

2. Space Discretization 2. Space Discretization

Criteria:

Entropy

Correlation with Output

Bins (equipartitioning)

Intervals

Clustered

Class Separation

Criteria:

Entropy

Correlation with Output

Bins (equipartitioning)

Intervals

Clustered

Class Separation

Parameter Choice: User Defined Minimizing Support Set

Parameter Choice: User Defined Minimizing Support Set

Quality Measures: Entropy Separability

Quality Measures: Entropy Separability

Page 11: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 11

Entropy Correlation with Output Bins

Intervals Clustered Class Separation

Page 12: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 12

3. Generation of Maximal Patterns 3. Generation of Maximal Patterns

Pattern Type Selection:Prime

ConesIntervals

Spanned

Pattern Type Selection:Prime

ConesIntervals

Spanned

Parameter Bound Settings:Prevalence:

% of positive observations% of negative observations

Homogeneity:on positive patternson negative patterns

Degree.

Parameter Bound Settings:Prevalence:

% of positive observations% of negative observations

Homogeneity:on positive patternson negative patterns

Degree.Post-Generation Filters:

By CharacteristicsMaximalityStrongness

Post-Generation Filters:By CharacteristicsMaximalityStrongness

Page 13: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 13

16 xi.e.,

Positive Patterns

Positive Patterns

Pattern Definition Training Set Test Set Pattern Definition Training Set Test Set

Page 14: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 14

Negative Patterns

Negative Patterns

Pattern Definition Training Set Test Set Pattern Definition Training Set Test Set

Page 15: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 15

4. Theories and Models 4. Theories and Models

PandectPandect

Theory Selection:via:

Greedy

Bottleneck Greedy

Lexicographic Greedy

Set Covering Heuristics

Theory Selection:via:

Greedy

Bottleneck Greedy

Lexicographic Greedy

Set Covering Heuristics

Model Selection:

2 Set-Covering Problems

Quadratic Set-Covering Problem

Model Selection:

2 Set-Covering Problems

Quadratic Set-Covering Problem

Page 16: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 16

4. Example (Model)4. Example (Model)

Page 17: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 17

5. Example (Classification)5. Example (Classification)

Page 18: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 18

Page 19: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 19

Page 20: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 20

Page 21: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 21

5. Discriminants 5. Discriminants

Weight Selection Methods:Direct

1. Prognostic Index

2. Weighted Prognostic Index

LP-Based

3. Distance Maximizing Separator (SVM)

4. Cost Minimizing Separator

5. Expected Value Separator

NLP-Based

6. Regression in Pattern Space (ANN)

7. Best Correlation with Output

(weighted sums of patterns)

Page 22: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 22

Prognostic Index Weighted Prognostic Expected Value Index Separator

Distance Maximizing Cost Minimizing Best Correlation Separator Separator with Output

Page 23: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 23

%83.93%25.2*5.0%75.97%40.8*5.0%24.884

1

Accuracy

Sensitivity Specificity

Page 24: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 24

Page 25: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 25

Reporting Reporting

CutpointsDiscretized SpacePandectCoverage of Observations by PatternsPattern Report (Compact/Full Versions)Theories/ModelsAttribute AnalysisLog File

Page 26: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 26

Pattern Space

Pattern Space

Training

+ + + + + + - - -Patterns

Test

+ + + + + + - - -Patterns

Positive Observations

Unclassified Observations

Negative Observations

Page 27: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 27

ClusteredPattern Space

ClusteredPattern Space

Page 28: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 28

AccuracySensitivitySpecificity

AccuracySensitivitySpecificity

BootstrapK-FoldingJackknife

BootstrapK-FoldingJackknife

Validation ProceduresValidation Procedures

Stratified Random Partition

Stratified Random Partition

LAD Model on Training Set

LAD Model on Training Set

Performance Evaluation

Performance Evaluation

Raw Data

Page 29: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 29

Special FeaturesSpecial Features

Generating User Model Generation(Excel Files)

Datascope Macro LanguageMultiple and Complex Experiments

Interface with Other Applications

(Datascope Server)

Page 30: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 30

Performance Performance C o m p a r a t i v e r e s u l t s f o r 5 d a t a s e t s f r o m t h e I r v i n e r e p o s i t o r yL A D a n d o t h e r 3 3 a l g o r i t h m s

D a t a s e t N a i v e B e s t ( B ) W o r s t ( W ) L A D ( L ) A c c u r a c y

b c w 3 5 3 9 3 . 5 0 . 0 8 9 9 . 4 8 %b l d 4 2 2 8 4 3 2 7 . 8 - 0 . 0 1 1 0 0 . 2 8 %

h e a 4 4 1 4 3 4 1 4 . 7 0 . 0 4 9 9 . 1 9 %p i d 3 3 2 2 3 1 2 1 . 5 - 0 . 0 6 1 0 0 . 6 4 %v o t 3 9 4 6 4 . 6 0 . 3 0 9 9 . 3 8 %

a v e r a g e 0 . 0 7 9 9 . 7 9 %

WBL 1:

Tjen-Sien Lim, Wei-Yin Loh and Yu-Shan Shin A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms, by, Machine Learning, 40, 203-229 (2000)

http://www.ics.uci.edu/~mlearn/MLRepository.html

Page 31: Sorin Alexe   RUTCOR, Rutgers University, Piscataway, NJ   e-mail: salexe@rutcor.rutgers

III 31

LAD Case Studies LAD Case Studies

Assessing Long-Term Mortality Risk After Exercise Electrocardiography

Ovarian Cancer Detection Using Proteomic Data

Combinatorial Analysis of Breast Cancer Data from Image Cytometry and Gene Expression Microarrays

Cell Proliferation on Medical Implants

Country Risk Rating