data preprocessing part 1 - health...
TRANSCRIPT
![Page 1: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/1.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DataPreprocessingPart1
JanuszWojtusiak,PhDGeorgeMasonUniversity
Fall2016
HAP780DataMininginHealthCare
![Page 2: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/2.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
“Theworldisfullofobviousthingswhichnobodybyanychanceeverobserves.”
-SherlockHolmes
![Page 3: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/3.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
WhyPreprocessing?
• Multiplesources• Multipleformats• Multiplerepresentations• Errors,noise• Missingvalues• Unnecessaryattributes• Not-representativedata• ….andmanymore!
![Page 4: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/4.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
TwoTypesofPreprocessing
• Beforeloadingtodatabase/software– Howtogetdatafrommultiplesourcesintodatabase,datawarehouse,orotherformatonwhichDMtoolscanbeused.
• Afterloadingtodatabase/software– Thisiswhatistypicallycoveredbydatapreprocessing:datacleaning,transformation,reduction,discretization,normalization…..
![Page 5: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/5.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
SourcesofDataforDataMining
• EHRsystems• Billing• Surveys• Reports• Web• Excelspreadsheets• Sensors
• Sometimesweminetogetherdatafrommultiplesources. Simplyspeaking,wewanttobeabletomineanydataandallavailabledata.
![Page 6: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/6.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
FormsofData
• One-dimensional– Signalsfromsensors(EKG,accelerometer,etc.)
• Two-dimensional– Images
• Multidimensional– Flatdatatables(attribute-valuepairs)– RelationalDatabases
• Multimedia
![Page 7: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/7.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
FormatsofData
• Structured– Tables– Relationaldatabases– Non-relational/No-SQLdatabases– TextFiles(comaseparated,specialformats)– XML– Excelfiles– SASdatafiles– ….
![Page 8: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/8.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
FormatsofData
• Unstructured– Textfiles– Websites– Textfieldsindatabases/structureddata– Speech– Multimedia– ….
![Page 9: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/9.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DirtyData
• Noise
• Incompleteness
• Inconsistency
![Page 10: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/10.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DirtyDataPTID DOB Age Sex ProvID Dx1 Dx2 Dx3 Dx4 Dx5
1 1/2/70 48 M 345 250.0
2 30 N 010.0 Patientissuffering formberculosis …
473.0
2 1/1/80 33 3456 487
34 487
5 9/8/60 F 327.0 327.2
Thefollowingrecords are imported afterJanuary
6 8/8/54 M 320 250.0 487 296.7 361.0 E858
7 Unknown
M JohnSmith
8 25 F 377 150 151 038.9
Howmanyproblemsareinthisdataset?
![Page 11: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/11.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DealingwithDirtyData
• Loaddatatodatabase– DataTypes– Obviousproblemsindatafiles
• Datacleaning&transformation– Inconsistencies,missingvalues,sampling,attributeselection,discretization,….
http://www.prosoftsolutions.net/blog/bid/146041/Dirty-Data-What-is-it-how-does-it-cause-problems-and-what-is-the-solution
![Page 12: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/12.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DataTypes
• Differentnamesforthesame– Field:usedindatabases– Attribute:usedindataminingandmachinelearning– Variable:usedinstatistics– Feature:usedinmachinelearning(usuallymeansbinaryattribute)
• DatabaseAttributeTypes
• AnalyticAttributeTypes
![Page 13: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/13.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
FundamentalConcepts• Symbol:aphysicalentity,itsstate,oritsbehaviorthatconveysachoicefrom
apredefinedsetofchoices.Thechoicesmayrefertoanyentities(physicalorabstractobjects),totheirproperties,ortheiractions.Thechoiceindicatedbyasymboliscalleditsmeaning
• Data:arecordedsetofsymbolscharacterizingasetofentities• Information:interpreteddata;datawhosesymbolshavebeenassignedmeaning
• Knowledge:informationthatisverifiedtobetrueortruetosomedegree,whichcanbeobtainedbydirectobservationorbyinference
• Belief:hypothetical knowledge; knowledge that has not been validated, but is characterized by some measure of it’s the relationship to the reality it describes.
![Page 14: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/14.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
SymbolsData
Information
Knowledge
Belief
![Page 15: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/15.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
FundamentalConcepts
• Concept:asetofentitiesconsideredasaunit,andtypicallygivenaname• Language:asystemofsymbolsandrulesforcreatingexpressionsfromthesesymbolsforthepurposeofcommunicatinginformation
• Description:anexpressioninsomelanguagethatconveysinformationaboutasetofentities.Thesetbeingdescribediscalledthereferenceset.Aconceptdescription describesallentitiesbelongingtotheconcept(conceptinstances)
• Generalization:aprocessofextendingthereferencesetofadescription,oritsresult
• Abstraction:aprocessofreducinginformationaboutareferenceset,oritsresult
![Page 16: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/16.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DatabaseAttributeTypes
• SystemSpecific• Forexample,inSQLServer2012:
![Page 17: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/17.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
Numeric andDate
![Page 18: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/18.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
StringsandOther
![Page 19: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/19.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
AnalyticDataTypes
• Symbolic– Symbolsusedtorepresententities
• Numeric– Numbers,usuallyusedforcalculations
![Page 20: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/20.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
AnalyticAttributeTypes
![Page 21: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/21.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
Extract,Transform,Load
• ETLisalmostalwaysusedincontextofdatawarehouses,butalsoappliedtodatamining– Extract datafromexternalsources(oftenmany)– Transform intouniformrepresentation– Load intothetargetsystem(DW,DM)
![Page 22: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/22.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
ETLinContext
EMR
Rx
Billing
PACS
ExtractTransform
Load DataWarehouse
FlatFiles
Reporting
DataMining
Analysis
![Page 23: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/23.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
ETLinContext
EMR
Rx
Billing
PACS
ExtractTransform
LoadFlatfilereadyforDataMining
FlatFiles
DataMining
![Page 24: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/24.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
ToolstoHave
• FileViewer• Textfileeditor,Editpad Pro,Notepad++– Notwordprocessor!
• Processingverylargetextfiles– awk,sed,grep,….
• Fileconverters,builtinsoftware…ornot– lotsoffreeones...
![Page 25: Data Preprocessing Part 1 - Health Informaticshi.gmu.edu/images/lecture_pdf/02_HAP780_Preprocessing... · 2016. 9. 29. · – How to get data from multiple sources into database,](https://reader034.vdocument.in/reader034/viewer/2022051902/5ff1974bb4e6307292761b02/html5/thumbnails/25.jpg)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
HAP780
JanuszWojtusiak,PhDGeorgeMasonUniversity