data preprocessing part 1 - health...

25
HAP 780 Data Mining in Health Care Copyright © Janusz Wojtusiak, 2016 Data Preprocessing Part 1 Janusz Wojtusiak, PhD George Mason University Fall 2016 HAP 780 Data Mining in Health Care

Upload: others

Post on 13-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

DataPreprocessingPart1

JanuszWojtusiak,PhDGeorgeMasonUniversity

Fall2016

HAP780DataMininginHealthCare

Page 2: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

“Theworldisfullofobviousthingswhichnobodybyanychanceeverobserves.”

-SherlockHolmes

Page 3: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

WhyPreprocessing?

• Multiplesources• Multipleformats• Multiplerepresentations• Errors,noise• Missingvalues• Unnecessaryattributes• Not-representativedata• ….andmanymore!

Page 4: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

TwoTypesofPreprocessing

• Beforeloadingtodatabase/software– Howtogetdatafrommultiplesourcesintodatabase,datawarehouse,orotherformatonwhichDMtoolscanbeused.

• Afterloadingtodatabase/software– Thisiswhatistypicallycoveredbydatapreprocessing:datacleaning,transformation,reduction,discretization,normalization…..

Page 5: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

SourcesofDataforDataMining

• EHRsystems• Billing• Surveys• Reports• Web• Excelspreadsheets• Sensors

• Sometimesweminetogetherdatafrommultiplesources. Simplyspeaking,wewanttobeabletomineanydataandallavailabledata.

Page 6: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

FormsofData

• One-dimensional– Signalsfromsensors(EKG,accelerometer,etc.)

• Two-dimensional– Images

• Multidimensional– Flatdatatables(attribute-valuepairs)– RelationalDatabases

• Multimedia

Page 7: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

FormatsofData

• Structured– Tables– Relationaldatabases– Non-relational/No-SQLdatabases– TextFiles(comaseparated,specialformats)– XML– Excelfiles– SASdatafiles– ….

Page 8: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

FormatsofData

• Unstructured– Textfiles– Websites– Textfieldsindatabases/structureddata– Speech– Multimedia– ….

Page 9: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

DirtyData

• Noise

• Incompleteness

• Inconsistency

Page 10: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

DirtyDataPTID DOB Age Sex ProvID Dx1 Dx2 Dx3 Dx4 Dx5

1 1/2/70 48 M 345 250.0

2 30 N 010.0 Patientissuffering formberculosis …

473.0

2 1/1/80 33 3456 487

34 487

5 9/8/60 F 327.0 327.2

Thefollowingrecords are imported afterJanuary

6 8/8/54 M 320 250.0 487 296.7 361.0 E858

7 Unknown

M JohnSmith

8 25 F 377 150 151 038.9

Howmanyproblemsareinthisdataset?

Page 11: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

DealingwithDirtyData

• Loaddatatodatabase– DataTypes– Obviousproblemsindatafiles

• Datacleaning&transformation– Inconsistencies,missingvalues,sampling,attributeselection,discretization,….

http://www.prosoftsolutions.net/blog/bid/146041/Dirty-Data-What-is-it-how-does-it-cause-problems-and-what-is-the-solution

Page 12: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

DataTypes

• Differentnamesforthesame– Field:usedindatabases– Attribute:usedindataminingandmachinelearning– Variable:usedinstatistics– Feature:usedinmachinelearning(usuallymeansbinaryattribute)

• DatabaseAttributeTypes

• AnalyticAttributeTypes

Page 13: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

FundamentalConcepts• Symbol:aphysicalentity,itsstate,oritsbehaviorthatconveysachoicefrom

apredefinedsetofchoices.Thechoicesmayrefertoanyentities(physicalorabstractobjects),totheirproperties,ortheiractions.Thechoiceindicatedbyasymboliscalleditsmeaning

• Data:arecordedsetofsymbolscharacterizingasetofentities• Information:interpreteddata;datawhosesymbolshavebeenassignedmeaning

• Knowledge:informationthatisverifiedtobetrueortruetosomedegree,whichcanbeobtainedbydirectobservationorbyinference

• Belief:hypothetical knowledge; knowledge that has not been validated, but is characterized by some measure of it’s the relationship to the reality it describes.

Page 14: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

SymbolsData

Information

Knowledge

Belief

Page 15: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

FundamentalConcepts

• Concept:asetofentitiesconsideredasaunit,andtypicallygivenaname• Language:asystemofsymbolsandrulesforcreatingexpressionsfromthesesymbolsforthepurposeofcommunicatinginformation

• Description:anexpressioninsomelanguagethatconveysinformationaboutasetofentities.Thesetbeingdescribediscalledthereferenceset.Aconceptdescription describesallentitiesbelongingtotheconcept(conceptinstances)

• Generalization:aprocessofextendingthereferencesetofadescription,oritsresult

• Abstraction:aprocessofreducinginformationaboutareferenceset,oritsresult

Page 16: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

DatabaseAttributeTypes

• SystemSpecific• Forexample,inSQLServer2012:

Page 17: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

Numeric andDate

Page 18: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

StringsandOther

Page 19: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

AnalyticDataTypes

• Symbolic– Symbolsusedtorepresententities

• Numeric– Numbers,usuallyusedforcalculations

Page 20: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

AnalyticAttributeTypes

Page 21: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

Extract,Transform,Load

• ETLisalmostalwaysusedincontextofdatawarehouses,butalsoappliedtodatamining– Extract datafromexternalsources(oftenmany)– Transform intouniformrepresentation– Load intothetargetsystem(DW,DM)

Page 22: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

ETLinContext

EMR

Rx

Billing

PACS

ExtractTransform

Load DataWarehouse

FlatFiles

Reporting

DataMining

Analysis

Page 23: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

ETLinContext

EMR

Rx

Billing

PACS

ExtractTransform

LoadFlatfilereadyforDataMining

FlatFiles

DataMining

Page 24: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

ToolstoHave

• FileViewer• Textfileeditor,Editpad Pro,Notepad++– Notwordprocessor!

• Processingverylargetextfiles– awk,sed,grep,….

• Fileconverters,builtinsoftware…ornot– lotsoffreeones...

Page 25: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,

HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016

HAP780

JanuszWojtusiak,PhDGeorgeMasonUniversity

[email protected]