ppt presentation - codata, the committee on data for science and

13
College of Information Science & Technology University of Nebraska’s

Upload: tommy96

Post on 06-May-2015

503 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: PPT Presentation - CODATA, The Committee on Data for Science and

College of Information Science & Technology

University of Nebraska’s

Page 2: PPT Presentation - CODATA, The Committee on Data for Science and

Privacy-preserving Data Mining of Medical Data using Data Separation

Based Techniques

Gang Kou, Yi Peng, Yong Shi, and Zhengxin Chen

Page 3: PPT Presentation - CODATA, The Committee on Data for Science and

Data Mining and Knowledge Discovery

“Knowledge Discovery is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. ”— U. Fayyad, G. Piatetsky-Shapiro, and P. SmythAdvances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996

Page 4: PPT Presentation - CODATA, The Committee on Data for Science and

We are drowning in data, but starving for knowledge! Adopted from Jiawei Han, UIUC

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Page 5: PPT Presentation - CODATA, The Committee on Data for Science and

A Caveman's WisdomSource: The Futurist, December 1982, Tom Chalkley

InformationInformation is data that has been given context and is meaningful to a particular audienceKnowledgeKnowledge is information with a broader, general understanding of the subject matter WisdomWisdom is knowledge embedded with the reflection or application of that knowledge

Page 6: PPT Presentation - CODATA, The Committee on Data for Science and

Introduction

Data mining is concerned with the extraction of useful knowledge from various types of data.

Medical data mining has been a popular data mining topic of late. Compared with other data mining areas, medical data mining has some unique characteristics. Since medical files are related to human subjects, privacy concern is taken more seriously than other data mining tasks.

This paper applied data separation-based techniques to preserve privacy in classification of medical data.

We implement these two approaches using medical datasets from UCI KDD archive and report the experimental results.

Page 7: PPT Presentation - CODATA, The Committee on Data for Science and

Medical Datasets in this research

Wisconsin prognostic breast cancer dataset has 699 records and 9 variables. These records belong to either benign or malignant class. We create nine different sub-datasets by removing one variable at a time.

The heart-disease dataset has 797 records and 13 variables. These records belong to either heart-disease or normal class. This data was collected from Cleveland Clinic Foundation, Hungarian Institute of Cardiology, University Hospital of Zurich, and Long Beach V.A. Medical Center.

Page 8: PPT Presentation - CODATA, The Committee on Data for Science and

Privacy-preserving Data Mining

– Data obfuscation• Nobody sees the real data so it hide the protected information• Approaches– Randomly modify data– Swap values between

records– Controlled modification of data to hide secrets

– Summarization• Only the needed facts are exposed • Approaches– Overall collection statistics– Limited query functionality

– Data separation• Data remains with trusted parties• Approaches– Vertical separation– Horizontal separation

Page 9: PPT Presentation - CODATA, The Committee on Data for Science and

Privacy-preserving Classification (Vertical) Process

Input: The Medical dataset M , each of the medical records has m attributes

Output: Average classification accuracies for benign and malignant of the dataset in 10-fold cross-validation; scores for all records; decision trees ensemble.

Step 1 Generate m subsets with one different attribute removed from M at each time.

Step 2 Training each subset with See5 with adaptive boosting and 10-fold cross validation to get m decision trees.

Step 3 Ensemble the final decision function D, via majority vote of the m decision trees from step 2.

Step 4 Classify M by the final decision function. END

Page 10: PPT Presentation - CODATA, The Committee on Data for Science and

Privacy-preserving Classification (Horizontal) Process

Input: The Medical dataset from r different sources, M1, M2, Mr, each of the medical records has m attributes

Output: Average classification accuracies for Normal and Heart-disease of the dataset in 10-fold cross-validation; scores for all records; decision trees ensemble.

Step 1 Training r datasets with See5 with adaptive boosting and 10-fold cross validation to get r decision trees .

Step 2 Ensemble the final decision function D, via majority vote of the r decision trees from step 1.

Step 3 Classify all r datasets by the final decision function. END

Page 11: PPT Presentation - CODATA, The Committee on Data for Science and

Privacy-preserving Data Mining

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

All 1 2 3 4 5 6 7 8 9 MV

Malignant Error

Benign Error

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

All C H V MV

Heart-Disease

Normal

Vertical partition - each site uses a portion of the attributes to compute its results and the distributed results are assembled at a central trusted party using majority-vote ensemble method.

Horizontal partition - data are distributed among several sites. Each site computes its own data and a central party is responsible to integrate these results.

Page 12: PPT Presentation - CODATA, The Committee on Data for Science and

Conclusion

Privacy-preserving is an important issue in medical data mining. This paper investigates data separation techniques in medical data classification. The experiments demonstrate that data separation techniques can not only protect data privacy, but also increase classification accuracy sometimes (e.g., horizontally partitioned data).

Page 13: PPT Presentation - CODATA, The Committee on Data for Science and

University of Nebraska’s

Comment & Question谢谢大家! Thank You!