a. holzinger 709.049 med. informatikhuman-centered.ai/wordpress/wp-content/uploads/... ·...

Post on 01-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Statusasof Mo,19.10.201508:30DearStudents – welcometothesecondlectureofourcourse“biomedicalinformatics”,pleaserememberfromthelastlecturethedefinition:AccordingtotheAmericanAssociationofMedicalInformatics(AMIA)thetermMedicalInformaticshasnowbeenexpandedtoBiomedicalInformaticsandisdefinedas“theinterdisciplinaryfieldthatstudiesandpursuestheeffectiveuseofbiomedicaldata,information,andknowledgeforscientificinquiry,problemsolving,anddecisionmaking,motivatedbyeffortstoimprovehumanhealth”.[1]Itisimportanttoknow:Bioinformatics+MedicalInformatics=BiomedicalInformatics(seeSlide1‐42).Note:Computersarejustthevehiclestorealizethecentralgoals:Toharnessthepowerofthemachinestosupportandtoamplifyhumanintelligence[2].[1]Shortliffe,E.H.2011.BiomedicalInformatics:DefiningtheScienceanditsRoleinHealthProfessionalEducation.In:Holzinger,A.&Simonic,K.‐M.(eds.)InformationQualityine‐Health.LectureNotesinComputerScienceLNCS7058.Heidelberg,NewYork:Springer,pp.711‐714.[2]Holzinger,A.2013.Human–ComputerInteractionandKnowledgeDiscovery(HCI‐KDD):Whatisthebenefitofbringingthosetwofieldstoworktogether?In:Cuzzocrea,A.,Kittl,C.,Simos,D.E.,Weippl,E.&Xu,L.(eds.)MultidisciplinaryResearchandPracticeforInformationSystems,SpringerLectureNotesinComputerScienceLNCS8127.Heidelberg,Berlin,NewYork:Springer,pp.319‐328.Regardingthecurrenttrendtowardspersonalizedmedicinehaveareadofthispaper:Holzinger,A.2014.TrendsinInteractiveKnowledgeDiscoveryforPersonalizedMedicine:CognitiveSciencemeetsMachineLearning.IEEEIntelligentInformaticsBulletin,15,(1),6‐14.Onlineavailable via:http://www.comp.hkbu.edu.hk/~cib/2014/Dec/article2/iib_vol15no1_article2.pdf

1WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Inthissecondlecturewestartwithalookondatasources,reviewsomedatastructures,discussstandardizationversusstructurization,reviewthedifferencesbetweendata,informationandknowledgeandclosewithanoverviewaboutinformationentropy.

2WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Acentral topicisthedimensionalityofdataandthe interrelated(connected)curseofdimensionalitywhichreferstovariousphenomenathatarisewhenanalyzingandorganizingdatainhigh‐dimensionalspaces(thousandsofdimensions)thatdonotoccurinlow‐dimensionalsettingssuchasthethree‐dimensionalphysicalspaceofoureverydayworld.

3WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

4WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

5WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

6WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Arecommendable smallbookletis:

Scheinerman,E.R.2011.MathematicalNotation:AGuideforEngineersandScientists,Baltimore(MD),Scheinerman.Whichalsoincludes themostimportantLATEXcommandsforproducingmaths symbols

http://www.ams.jhu.edu/~ers/notation/

7WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

componentsthroughpandw(Golan,Judge,andMiller;1996);

8WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Holzinger,A.,Dehmer,M.&Jurisica,I.2014.KnowledgeDiscoveryandinteractiveDataMininginBioinformatics‐ State‐of‐the‐Art,futurechallengesandresearchdirections.BMCBioinformatics,15,(S6),I1.

Hund,M.,Sturm,W.,Schreck,T.,Ullrich,T.,Keim,D.,Majnaric,L.&Holzinger,A.2015.AnalysisofPatientGroupsandImmunizationResultsBasedonSubspaceClustering.In:Guo,Y.,Friston,K.,Aldo,F.,Hill,S.&Peng,H.(eds.)BrainInformaticsandHealth,LectureNotesinArtificialIntelligenceLNAI9250.Cham:SpringerInternationalPublishing,pp.358‐368.

Holzinger,A.,Stocker,C.&Dehmer,M.2014.BigComplexBiomedicalData:TowardsaTaxonomyofData.In:Obaidat,M.S.&Filipe,J.(eds.)CommunicationsinComputerandInformationScienceCCIS455.BerlinHeidelberg:Springerpp.3‐18.

Relatedrecommended reading:

Dong‐Hee,S.&MinJae,C.2015.Ecologicalviewsofbigdata:Perspectivesandissues.TelematicsandInformatics,32,(2),311‐320.

Dong,X.L.&Srivastava,D.2015.BigDataIntegration.SynthesisLecturesonDataManagement,7,(1),1‐198.

Wu,X.D.,Zhu,X.Q.,Wu,G.Q.&Ding,W.2014.DataMiningwithBigData.IEEETransactionsonKnowledgeandDataEngineering,26,(1),97‐107.

Shneiderman,B.2014.TheBigPictureforBigData:Visualization.Science,343,(6172),730‐730.

Sackman,J.E.&Kuchenreuther,M.2014.MarryingBigDatawithPersonalizedMedicine.Biopharm International,27,(8),36‐38.

9WS 2015

Withregardto data,thedifferencebetweenclassicalstatisticsandmodern machinelearningisthatmachinelearningdiscoversintricatestructuresinlargedatasetstoindicatehowamachineshouldchangeitsinternalparameters.

A. Holzinger    LV 709.049 Med. Informatik                                

WS 2015 10

JohnvonNeumannandhishigh‐speedcomputer,approx.1952

Ourfirstquestionis:Wheredoesthedatacomefrom?Thesecondquestion:Whatkindofdataisthis?Thethirdquestion:Howbigisthisdata?So,letuslookatsomebiomedicaldatasources(seeSlide2‐1):

11WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Duetotheincreasingtrendtowardspersonalizedandmolecularmedicine,biomedicaldataresultsfromvarioussourcesindifferentstructuraldimensions,rangingfromthemicroscopicworld(e.g.genomics,epigenomics,metagenomics,proteomics,metabolomics)tothemacroscopicworld(e.g.diseasespreadingdataofpopulationsinpublichealthinformatics).Justfororientation:theGlucosemoleculehasasizeof900 =900 10 andtheCarbonatomapprox.300 .Ahepatitisvirusisrelativelylargewith45 =45 10 andtheX‐Chromosomemuchbiggerwith7 =7 10 .Herealotof“bigdata”isproduced,e.g.genomics,metabolomicsandproteomicsdata.Thisisreally“bigdata”– thedatasetsenormouslylarge– whereasineachindividualweestimatemanyTerabytes(1TB=1 10 Byte=1000GByte)ofgenomicsdata,weareconfrontedwithPetabytesofproteomicsdataandthefusionofthoseforpersonalizedmedicineresultsinExabytes ofdata(1EB=1 10 Byte).

Ofcoursetheseamountsareforeachhumanindividual,however,wehaveacurrentworldpopulationof7Billion(1BillioninEnglishlanguageis1MilliardinEuropeanlanguage)people(=7 10 people).Soyoucanseethatthisisreally“bigdata”.This“natural”dataisthenfusedwith“produced”data,e.g.theunstructureddata(text)inthepatientrecords,ordatafromphysiologicalsensorsetc.– thesedataisalsorapidlyincreasinginsizeandcomplexity.Youcanimaginethatwithoutcomputationalintelligencewehavenochancetosurviveinthiscomplexbigdatasets.

http://learn.genetics.utah.edu/content/begin/cells/scale/C‐Atom340pm=340.10‐12mMoleculeGlucose900pmVirus HepatitisVirus45nm=45.10‐9mMicroscope200.10‐9mConfocalmicroscopy 20.10‐6mElectron‐Microscopy0,1.10‐9mX‐Chromosome7.10‐6mDNA2.10‐9mEncyme =Metabolomics

Holzinger,A.,Dehmer,M.&Jurisica,I.2014.KnowledgeDiscoveryandinteractiveDataMininginBioinformatics‐ State‐of‐the‐Art,futurechallengesandresearchdirections.BMCBioinformatics,15,(S6),I1.

12WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

MostofourcomputersareVon‐Neumannmachines(seechapter1),consequentlyatthelowestphysicallayer,dataisrepresentedaspatternsofelectricalon/offstates(1/0,H/L,high/low);wespeakofabit,whichisalsoknownasBit,theBasicindissolubleinformationunit(Shannon,1948).DonotconfusethisBitwiththeIEC60027‐2symbolbit– insmallletters– whichisusedasanSIdimensionprefix(e.g.1Kbit=1024bit,1Byte=8bit).Beginningwiththephysicallevelofdatawecandeterminevariouslevelsofdatastructures(seeSlide2‐2):Referto:http://physics.nist.gov/cuu/Units/binary.html1)Physicallevel: inaVon‐Neumannsystem:bit;inaQuantumsystem:qubitNote: Regardlessofitsphysicalrealization(e.g.voltage,ormechanicalstate,orblack/whiteetc.),abitisalwayslogicallyeither0or1(analog toalight‐switch).Aqubithassimilaritiestoaclassicalbit,butisoverallverydifferent:Aclassicalbitisascalarvariablewiththesinglevalueofeither0or1,sothevalueisunique,deterministicandunambiguous.Aqubitismoregeneralinthesensethatitrepresentsastatedefinedbyapairofcomplexnumbers , , whichexpresstheprobabilitythatareadingofthevalueofthequbitwillgiveavalueof0or1.Thus,aqubitcanbeinthestateof0,1,orsomemixture‐ referredtoasasuperposition‐ ofthe0and1states.Theweightsof0and1inthissuperpositionaredeterminedby(a,b)inthefollowingway:qubit≜ , ≜ 0 1 .Pleasebeawarethatthismodelofquantumcomputationisnottheonlyone(Lanzagorta &Uhlmann,2008).

Forarecentoverviewonquantumcomputation pleasereferto:http://peterwittek.com/book.html2)LogicalLevel:1)Primitivedatatypes,including:a)Booleandatatype(true/false);b)numericaldatatype(e.g.integer( ,floating‐pointnumbers(“reals”),etc.);2)compositedatatypes,including:a)array,b)record,c)union,d)set(storesvalueswithoutanyparticularorder,andnorepeatedvalues),e)object(containsothers);3)Stringandtexttypes,including:a)alphanumericcharacters,b)alphanumericstrings(=sequenceofcharacterstorepresentwordsandtext)3)AbstractLevel: includingabstractdatastructures,e.g.queue(FIFO),stack(LIFO),set(noorder,norepeatedvalues),lists,hashtable,arrays,trees,graphs,…4)TechnicalLevel: Applicationdataformats,e.g.text,vectorgraphics,pixelimages,audiosignals,videosequences,multimedia,…5)HospitalLevel: Narrative(textual,naturallanguage)patientrecorddata(structured/unstructuredandstandardized/non‐standardized),Omicsdata(genomics,proteomics,metabolomics,microarraydata,fluxomics,phenomics),numericalmeasurements(physiologicaldata,timeseries,labresults,vitalsigns,bloodpressure,CO2 partialpressure,temperature,…),recordedsignals(ECG,EEG,ENG,EMG,EOG,EP…),graphics(sketches,drawings,handwriting,…);audiosignals,images(cams,x‐ray,MR,CT,PET,…),etc.

13WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Inbiomedicalinformaticswehavealottodowithabstractdatatypes(ADT),consequentlywebrieflyreviewthemostimportantoneshere.FordetailspleaserefertoacourseonAlgorithm&Datastructures,ortoaclassictextbooksuchas(Aho,Hopcroft&Ullman,1983),(Cormen etal.,2009),orinGerman(Ottmann &Widmayer,2012),(Holzinger,2003)andpleasetakeintoconsiderationthatdatastructuresandalgorithmsgohandinhand,so amust‐have‐on‐the‐deskofeverycomputerscientistis:Cormen,T.H.,Leiserson,C.E.,Rivest,R.L.&Stein,C.2009.IntroductiontoAlgorithms(3rdedition),Cambridge(MA),TheMITPress.

Listisasequentialcollectionofitems , , … , accessibleoneafteranother,beginningattheheadandendingatthetailz.InaVon‐Neumannmachineitisawidelyuseddatastructureforapplicationswhichdonotneedrandomaccess.Itdiffersfromthestack(last‐in‐first‐out,LIFO)andqueue(first‐in‐first‐out,FIFO)datastructuresinsofar,thatadditionsandremovalscanbemadeatanypositioninthelist.Incontrasttoasimpleset theorderisimportant.AtypicalexamplefortheuseofalistisaDNAsequence.ThecombinationofGGGTTTAAAissuchalist,theelementsofthelistarethenucleotidebases.Nucleotides arethejoinedmoleculeswhichformthestructuralunitsoftheRNAandtheDNAandplaythecentralroleinmetabolism.

14WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Graph isapair , ,where isasetoffinite,non‐emptyvertices(nodes)and isasetofedges(lines,arcs),whichare2‐elementsubsetsof .If isasetoforderedpairsofvertices(arcs,directededges,arrows),thenitisadirectedgraph(digraph).Thedistancesbetweentheedgescanberepresentedwithinadistance‐matric(twodimensionalarray).

Theedgesinagraphcanbemultidimensionalobjects, e.g.vectorscontainingtheresultsofmultipleGen‐expressionmeasures.Forthispurposethedistanceoftwoedgescanbemeasuredbyvariousdistancemetrics.Graphsareideallysuitedforrepresentingnetworksinmedicineandbiology,e.g.metabolismpathways,etc.Inbioinformatics,distancematricesareusedtorepresentproteinstructuresinacoordinate‐independentmanner,aswellasthepairwisedistancesbetweentwosequencesinsequencespace.Theyareusedinstructuralandsequentialalignment,andforthedeterminationofproteinstructuresfromNMRorX‐raycrystallography.Evolutionarydynamicsactonpopulations.Neithergenes,norcells,norindividualsevolve;onlypopulationsevolve.ThissocalledMoranprocess describesthestochasticevolutionofafinitepopulationofconstantsize:Ineachtimestep,anindividualischosenforreproductionwithaprobabilityproportionaltoitsfitness;asecondindividualischosenfordeath.Theoffspringofthefirstindividualreplacesthesecondandindividualsoccupytheverticesofagraph.Ineachtimestep,anindividualisselectedwithaprobabilityproportionaltoitsfitness;theweightsoftheoutgoingedgesdeterminetheprobabilitiesthatthecorrespondingneighborwillbereplacedbytheoffspring.Theprocessisdescribedbyastochasticmatrix ,where denotestheprobabilitythatanoffspringofindividuali willreplaceindividualj.Ateachtimestep,anedge isselectedwithaprobabilityproportionaltoitsweightandthefitnessoftheindividualatitstail.TheMoranprocessisacompletegraphwithidenticalweights(Lieberman,Hauert &Nowak,2005).

Graphs canberepresentedcomputationallybyanAdjacencylist,AdjacencymatrixandanIncidencematrix.Thefirstpre‐processingstepistoproducepoint clouddatasetsfromrawdata,see:Holzinger,A.,Malle,B.,Bloice,M.,Wiltgen,M.,Ferri,M.,Stanganelli,I.&Hofmann‐Wellenhof,R.2014.OntheGenerationofPointCloudDataSets:StepOneintheKnowledgeDiscoveryProcess.In:Holzinger,A.&Jurisica,I.(eds.)InteractiveKnowledgeDiscoveryandDataMininginBiomedicalInformatics,LectureNotesinComputerScience,LNCS8401.BerlinHeidelberg:Springer,pp.57‐80.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=974579&pCurrPk=83005

Forthe specifictaskofgettinggraphsfromimagedatahavealookat:Holzinger,A.,Malle,B.&Giuliani,N.2014.OnGraphExtractionfromImageData.In:Slezak,D.,Peters,J.F.,Tan,A.‐H.&Schwabe,L.(eds.)LectureNotesinArtificialIntelligence,LNAI8609.Heidelberg,Berlin:Springer,pp.552‐563.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=868952&pCurrPk=80830

15WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Treeisacollectionofelementscallednodes,oneofwhichisdistinguishedasaroot,alongwitharelation("parenthood")thatplacesahierarchicalstructureonthenodes.Anode,likeanelementofalist,canbeofwhatevertypewewish.Weoftendepictanodeasaletter,astring,oranumberwithacirclearoundit.Formally,atreecanbedefinedrecursivelyinthefollowingmanner:1.Asinglenodebyitselfisatree.Thisnodeisalsotherootofthetree.2.Suppose isanodeand 1, 2, . . . , aretreeswithroots 1, 2, . . . , , respectively.Wecanconstructanewtreebymaking betheparentofnodes 1, 2, . . . , .Inthistree istherootand 1, 2, . . . , arethesubtrees oftheroot.Nodes 1, 2, . . . , arecalledthechildrenofnode .

Dendrogram (fromGreekdendron "tree",‐gramma "drawing")isatreediagramfrequentlyusedtoillustratethearrangementoftheclustersproducedbyhierarchicalclustering.Dendrograms areoftenusedincomputationalbiologytoillustratetheclusteringofgenesorsamples.Theoriginofsuchdendrograms canbefoundin(Darwin,1859).Theexampleby(Hufford etal.,2012)showsaneighbor‐joiningtreeandthechangingmorphologyofdomesticatedmaizeanditswildrelatives.Taxaintheneighbor‐joiningtreearerepresentedbydifferentcolors:parviglumis (green),landraces(red),improvedlines(blue),mexicana (yellow)andTripsacum (brown).Themorphologicalchangesareshownforfemaleinflorescencesandplantarchitectureduringdomesticationandimprovement.

16WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Please rememberthekeyproblemsindealingwithdatainclude:1)Heterogeneousdatasources(needfordatafusionanddataintegration)2)Complexityofthedata(high‐dimensionality)3)Noisy,uncertaindata(challengeofpre‐processing)4)Thediscrepancybetweendata‐information‐knowledge(variousdefinitions)5)Bigdatasets(manualhandlingofthedataisimpossible)

17WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Nowthatwehaveseensomeexamplesofdatafromthebiomedicaldomain,wecanlookatthe“bigpicture”.Manyika etal.(2011)localizedfourmajordatapoolsintheUShealthcareanddescribethatthedataarehighlyfragmented,withlittleoverlapandlowintegration.Moreover,theyreportthatapprox.30%ofclinicaltext/numericaldataintheUnitedStates,includingmedicalrecords,bills,laboratoryandsurgeryreports,isstillnotgeneratedelectronically.Evenwhenclinicaldataareindigitalform,theyareusuallyheldbyanindividualproviderandrarelyshared(seeSlide2‐4).Biomedicalresearchdata,e.g.clinicaltrials,predictivemodelingetc.,isproducedbyacademiaandpharmaceuticalcompaniesandstoredindatabasesandlibraries.Clinicaldataisproducedinthehospitalandarestoredinhospitalinformationsystems(HIS),picturearchivingandcommunicationsystems(PACS)orinlaboratorydatabases,etc.Muchdataishealthbusinessdataproducedbypayors,providers,insurances,etc.Finally,thereisanincreasingpoolofpatientbehaviorandsentimentdata,producedbyvariouscustomersandstakeholders,outsidethetypicalclinicalcontext,includingthegrowingdatafromthewellnessandambientassistedlivingdomain.

18WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Amajorchallengeinournetworkedworldistheincreasingamountofdata– todaycalled“bigdata”.Thetrendtowardspersonalizedmedicinehasresultedinasheermassofthegenerated(‐omics)data,(seeSlide2‐7).Inthelifesciencesdomain,mostdatamodelsarecharacterizedbycomplexity,whichmakesmanualanalysisverytime‐consumingandfrequentlypracticallyimpossible(Holzinger,2013).

MoreandmoreOmics‐dataaregenerated,including:1)Genomicsdata(e.g.sequenceannotation),2)Transcriptomics data(e.g.microarraydata);thetranscriptome isthesetofallRNAmolecules,includingmRNA,rRNA,tRNA andnon‐codingRNAproducedinthecells.3)Proteomicsdata:Proteomicstudiesgeneratelargevolumesofrawexperimentaldataandinferredbiologicalresultsstoredindatarepositories,mostlyopenlyavailable;anoverviewcanbefoundhere:(Riffle&Eng,2009).Theoutcomeofproteomicsexperimentsisalistofproteinsdifferentiallymodifiedorabundantinacertainphenotype.Thelargesizeofproteomicsdatasetsrequiresspecializedanalyticaltools,whichdealwithlargelistsofobjects4)Metabolomics(e.g.enzymeannotation),themetabolome representsthecollectionofallmetabolitesinacell,tissue,organororganism.5)Protein‐DNAinteractions,6)Protein‐proteininteractions;PPIareatthecoreoftheentireinteractomics systemofanylivingcell.7)Fluxomics (isotopictracing,metabolicpathways),8)Phenomics (biomarkers),9)Epigenetics,isthestudyofthechangesingeneexpression– othersthantheDNAsequence,thereforetheprefix“epi‐“10)Microbiomics11)LipidomicsOmics‐dataintegrationhelpstoaddressinterestingbiologicalquestionsonthebiologicalsystemsleveltowardspersonalizedmedicine(Joyce&Palsson,2006).

19WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

MoreandmoreOmics‐dataaregenerated,including:1)Genomicsdata(e.g.sequenceannotation),2)Transcriptomicsdata(e.g.microarraydata);thetranscriptomeisthesetofallRNAmolecules,includingmRNA,rRNA,tRNA andnon‐codingRNAproducedinthecells.3)Proteomicsdata:Proteomicstudiesgeneratelargevolumesofrawexperimentaldataandinferredbiologicalresultsstoredindatarepositories,mostlyopenlyavailable;anoverviewcanbefoundhere:(Riffle&Eng,2009).Theoutcomeofproteomicsexperimentsisalistofproteinsdifferentiallymodifiedorabundantinacertainphenotype.Thelargesizeofproteomicsdatasetsrequiresspecializedanalyticaltools,whichdealwithlargelistsofobjects(Bessarabova etal.,2012).4)Metabolomics(e.g.enzymeannotation),themetabolome representsthecollectionofallmetabolitesinacell,tissue,organororganism.5)Protein‐DNAinteractions,6)Protein‐proteininteractions;PPIareatthecoreoftheentireinteractomics systemofanylivingcell.7)Fluxomics (isotopictracing,metabolicpathways),8)Phenomics (biomarkers),9)Epigenetics,isthestudyofthechangesingeneexpression– othersthantheDNAsequence,thereforetheprefix“epi‐“10)Microbiomics11)LipidomicsOmics‐dataintegrationhelpstoaddressinterestingbiologicalquestionsonthebiologicalsystemsleveltowardspersonalizedmedicine(Joyce&Palsson,2006).

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2908408/

Formoreinformationpleasereferto:Gomez‐Cabrero,D.,Abugessaisa,I.,Maier,D.,Teschendorff,A.,Merkenschlager,M.,Gisel,A.,Ballestar,E.,Bongcam‐Rudloff,E.,Conesa,A.&Tegner,J.2014.Dataintegrationintheeraofomics:currentandfuturechallenges.BMCSystemsBiology,8,(Suppl 2),I1.http://www.biomedcentral.com/1752‐0509/8/S2/I1

20WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Afurtherchallengeistointegratethedataandtomakeitaccessibletotheclinician.Whilethereismuchresearchontheintegrationofheterogeneousinformationsystems,ashortcomingisintheintegrationofavailabledata.Datafusionistheprocessofmergingmultiplerecordsrepresentingthesamereal‐worldobjectintoasingle,consistent,accurate,andusefulrepresentation(Bleiholder &Naumann,2008).AnexampleforthemixofdifferentdataforsolvingamedicalproblemcanbeseeninSlide2‐8.

AgoodexampleforcomplexmedicaldataisRCQM,whichisanapplicationthatmanagestheflowofdataandinformationintherheumatologyoutpatientclinic(50patientsperday,5daysperweek)ofGrazUniversityHospital,onthebasisofaqualitymanagementprocessmodel.Eachexaminationproduces100+clinicalandfunctionalparametersperpatient.Thisamasseddataaremorphedintobetteruseableinformationbyapplyingscoringalgorithms(e.g.DiseaseActivityScore,DAS)andareconvolutedovertime.Togetherwithpreviousfindings,physiologicallaboratorydata,patientrecorddataandOmicsdatafromthePathologydepartment,thesedataconstitutetheinformationbasisforanalysisandevaluationofthediseaseactivity.Thechallengeisintheincreasingquantitiesofsuchhighlycomplex,multi‐dimensionalandtimeseriesdata,seeanexample here:Simonic,K.M.,Holzinger,A.,Bloice,M.&Hermann,J.OptimizingLong‐TermTreatmentofRheumatoidArthritiswithSystematicDocumentation.ProceedingsofPervasiveHealth‐ 5thInternationalConferenceonPervasiveComputingTechnologiesforHealthcare,2011Dublin.IEEE,550‐554.http://www.biomedcentral.com/1472‐6947/13/103

21WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Donotconfusestructurewithstandardization(seeSlide2‐9).Datacanbestandardized(e.g.numericalentriesinlaboratoryreports)andnon‐standardized.Atypicalexampleisnon‐standardizedtext– impreciselycalled“Free‐Text”or“unstructureddata”inanelectronicpatientrecord(Kreuzthaleretal.,2011).

Standardizeddata isthe basisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandards canensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.

Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andiv)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem(refertoIOM).Technicalelementsfordatasharingrequirestandardizationofidentification,recordstructure,terminology,messaging,privacyetc.ThemostusedstandardizeddatasettodateistheinternationalClassificationofDiseases(ICD),whichwasfirstadoptedin1900forcollectingstatistics(Ahmadian etal.,2011),whichwewilldiscussin→Lecture3.Non‐standardizeddata isthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Well‐structureddata istheminorityofdataandanidealisticcasewheneachdataelementhasanassociateddefinedstructure,relationaltables,ortheresourcedescriptionframeworkRDF,ortheWebOntologyLanguageOWL(see→Lecture3).Note:Ill‐structured isatermoftenusedfortheoppositeofwell‐structured,althoughthistermoriginallywasusedinthecontextofproblemsolving(Simon,1973).Semi‐structuredisaformofstructureddatathatdoesnotconformwiththestrictformalstructureoftablesanddatamodelsassociatedwithrelationaldatabasesbutcontainstagsormarkerstoseparatestructureandcontent,i.e.areschema‐lessorself‐describing;atypicalexampleisamarkup‐languagesuchasXML(see→Lecture3and4).Weakly‐Structureddata isthemostofourdatainthewholeuniverse,whetheritisinmacroscopic(astronomy)ormicroscopicstructures(biology)– see→Lecture5.Non‐structureddata orunstructureddata isanimprecisedefinitionusedforinformation expressedinnaturallanguage,whennospecificstructurehasbeendefined.Thisisanissuefordebate:Texthasalsosomestructure:words,sentences,paragraphs.Ifweareveryprecise,unstructureddatawouldmeantthatthedataiscompleterandomized– whichisusuallycallednoiseandisdefinedby(Duda,Hart&Stork,2000)asanypropertyofdatawhichisnotduetotheunderlyingmodelbutinsteadtorandomness(eitherintherealworld,fromthesensorsorthemeasurementprocedure).

22WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

“Multivariate”and“multidimensional”aremodernwordsandconsequentlyoverusedinliterature.Eachitemofdataiscomposedofvariables, andifsuchadataitemisdefinedbymorethanonevariableitiscalledamultivariabledataitem.Variablesarefrequentlyclassifiedintotwocategories:dependentorindependent.

Somemorereadings onthehomepageofYosuhua Bengio,UniversityofMontreal:http://www.iro.umontreal.ca/~bengioy/yoshua_en/research.html

AndtheMILA Lab– MontrealInstituteforLearningAlgorithmshttp://www.mila.umontreal.ca/

23WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

InPhysics,EngineeringandStatisticsavariableisaphysicalpropertyofasubject,whosequantitycanbemeasured,e.g.mass,length,time,temperature,etc.Inmathematicsa0‐dimensionalspace(nil‐dimensional)isatopologicalspacethathasdimensionzero– which isaninfinitesimalsmallpoint.a 1‐dimensionalspaceisalineinR1a2‐dimensionalspaceistheplaneinR2A3‐dimensionalspaceisasphere(orcube,cylinderetc.)inR3

24WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

SMILESdata(.smi)consists ofastringobtainedbythesymbolnodesencounteredinadepth‐firsttreetraversalofachemicalgraph, whichisfirsttrimmedtoremovehydrogenatomsandcyclesarebrokentoturnitintoaspanningtree.Wherecycleshavebeenbroken,numericsuffixlabelsareincludedtoindicatetheconnectednodes.

25WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Proteomicanalysisofmesenchymal stemcells(MSCs).Two‐dimensionalgelelectrophoresiswasperformedusingwholeproteincellextractsfromP2MSCculturesofpatientswithrheumatoidarthritis(RA)(n=10)(A)andhealthycontrols(n=6)(B).Afterscanning,spotdetection,quantificationandnormalisation,gelswerecomparedusingHierarchicalClusteringSoftwareandPearsontest(C).Noclustercouldbedetectedusingtheseproteomicprofiles.

Proteomicanalysis:Two‐dimensionalelectrophoresiswasperformedusingP2MSCsinpatientswithRA(n=10)andhealthycontrols(n=6)(fig4A,B).ByusingtheHierarchicalClusteringmethod,wecouldnotdefineanyclusterthatmightdiscriminatepatientandcontrolcells(fig4C).ThePearsoncorrelationcoefficientwasnotsignificantlydifferentbetweenpatientandcontrolcells(r=0.933(0.022)andr=0.929(0.020),respectively).Thesedatacorroboratethelackofsignificantchangesincytokineproductionbetweenpatientsandcontrols.

26WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

http://www.rcsb.org/pdb/images/3ond_bio_r_500.jpgThePDBisalargerepository containing3‐Dstructuralinformation,establishedin1971Dataastoredin2Dbutcaninfactrepresentbiologicalentitiesinthreeormoredimensions

27WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Transaxial (left),coronal(middle),andsagittal(right)imagesofapatientwhowasscannedfor30mininlist‐modewiththeBrainPET scanner;therecordingwasstarted20minafterinjectionofabout300MBq fluor‐deoxy‐glucose.

28WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

InMathematics,henceinInformatics,however,avariableisassociatedwithaspace–oftenann‐dimensionalEuclideanspace – inwhichanentity(e.g.afunction)oraphenomenonofcontinuousnatureisdefined.Thedatalocationwithinthisspacecanbereferencedbyusingarangeofcoordinatesystems(e.g.Cartesian,Polar‐coordinates,etc.):Thedependentvariablesarethoseusedtodescribetheentity(forexamplethefunctionvalue)whilsttheindependentvariablesarethosethatrepresentthecoordinatesystemusedtodescribethespaceinwhichtheentityisdefined.Ifadatasetiscomposedofvariableswhoseinterpretationfitsthisdefinitionourgoalistounderstandhowthe‘entity’isdefinedwithinthen‐dimensionalEuclideanspace .Sometimeswemaydistinguishbetweenvariablesmeaningmeasurementofproperty,fromvariablesmeaningacoordinatesystem,byreferringtotheformerasvariate,andreferringtothelatterasdimension(DosSantos&Brodlie,2002), (dosSantos&Brodlie,2004).Aspaceisasetofpoints.Ametricspacehasanassociatedmetric,whichenablesustomeasuredistancesbetweenpointsinthatspaceand,inturn,implicitlydefinetheirneighborhoods.Consequently,ametricprovidesaspacewithatopology,andametricspaceisatopologicalone.Topologicalspacesfeelalientousbecauseweareaccustomedtohavingametric.BiomedicalExample:Aproteinisasinglechainofaminoacids,whichfoldsintoaglobularstructure.TheThermodynamicsHypothesisstatesthataproteinalwaysfoldsintoastateofminimumenergy.Topredictproteinstructure,wewouldliketomodelthefoldingofaproteincomputationally.Assuch,theproteinfoldingproblembecomesanoptimizationproblem:Wearelookingforapathtotheglobalminimuminaveryhigh‐dimensionalenergylandscape;

29WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Letuscollect ‐dimensional observationsintheEuclideanvectorspace andweget:Eq.2‐1

, … ,

Acloudofpointssampledfromanysource(e.g.medicaldata,sensornetworkdata,asolid3‐Dobject,surfaceetc.).ThosedatapointscanbecoordinatedasanunorderedsequenceinanarbitrarilyhighdimensionalEuclideanspace,wheremethodsofalgebraictopologycanbeapplied.Themainchallengeisinmappingthedatabackinto ortobemorepreciseinto ,becauseourretinaisinherentlyperceivingdatain .Thecloudofsuchdatapointscanbeusedasacomputationalrepresentationoftherespectivedataobject.Atemporalversioncanbefoundinmotion‐capturedata,wheregeometricpointsarerecordedastimeseries.Nowyouwillaskanobviousquestion:“Howdowevisualizeafour‐dimensionalobject?”Theobviousansweris:“Howdowevisualizeathreedimensionalobject?”Humansdonotseeinthreespatialdimensionsdirectly,butviasequencesofplanarprojectionsintegratedinamannerthatissensedifnotcomprehended.Littlechildrenspendasignificanttimeoftheirfirstyearoflifelearninghowtoinferthree‐dimensionalspatialdatafrompairedplanarprojections,andmanyyearsofpracticehavetunedaremarkableabilitytoextractglobalstructurefromrepresentationsinastrictlylowerdimension(Ghrist,2008).Becausewehavethesameproblemhereinthisbook,wemuststayin andthereforetheexampleinSlide2‐12(Zomorodian,2005).InEinstein'stheoryofSpecialRelativity,Euclidean3‐spaceplustime(the"4th‐dimension")areunifiedintotheMinkowskispace

30WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Ametricspacehasanassociatedmetric,whichenablestomeasurethedistancesbetweenpointsinthatspaceand,implicitlydefinetheirneighborhoods.Consequently,ametricprovidesaspacewithatopology,henceametricspaceisatopologicalspace.AsetXwithametricfunctiondiscalledametricspace.Wegiveitthemetrictopologyofd,wherethesetofopenballsMostofour“natural”spacesareaparticulartypeofmetricspaces:theEuclideanspaces:TheCartesianproductof copiesof ,thesetofrealnumbers,alongwiththeEuclideanmetric:Eq.2‐2

,

isthe ‐dimensionalEuclideanspace .Wemayinduceatopologyonsubsetsofmetricspacesasfollows:If ⊆ withtopology ,thenwegettherelativeorinducedtopology bydefiningFormoreinformationreferto(Zomorodian,2005)or(Edelsbrunner &Harer,2010).

31WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

KnowledgeDiscoveryfromData:Bygettinginsightintothedata;thegainedinformationcanbeusedtobuildupknowledge.Thegrandchallengeistomaphigherdimensionaldataintolowerdimensions,hencemakeitinteractivelyaccessibletotheend‐user(Holzinger,2012),(Holzinger,2013).Thismappingfrom → isthecoretaskofvisualizationandamajorcomponentforknowledgediscovery:Enablingeffectiveinteractivehumancontroloverpowerfulmachinealgorithmstosupporthumansensemaking(Holzinger,2012),(Holzinger,2013).

Holzinger,A.2013.Human–ComputerInteraction&KnowledgeDiscovery(HCI‐KDD):Whatisthebenefitofbringingthosetwofieldstoworktogether?In:AlfredoCuzzocrea,C.K.,Dimitris E.Simos,EdgarWeippl,Lida Xu (ed.)MultidisciplinaryResearchandPracticeforInformationSystems,SpringerLectureNotesinComputerScienceLNCS8127.Heidelberg,Berlin,NewYork:Springer,pp.319‐328.

An importanttopicissubspaceclustering:Hund,M.,Sturm,W.,Schreck,T.,Ullrich,T.,Keim,D.,Majnaric,L.&Holzinger,A.2015.AnalysisofPatientGroupsandImmunizationResultsBasedonSubspaceClustering.In:Guo,Y.,Friston,K.,Aldo,F.,Hill,S.&Peng,H.(eds.)BrainInformaticsandHealth,LectureNotesinArtificialIntelligenceLNAI9250.Cham:SpringerInternationalPublishing,pp.358‐368.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=1198810&pCurrPk=85960

32WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Multivariatedataset isadatasetthathasmanydependentvariables andtheymightbecorrelatedtoeachothertovaryingdegrees.Usuallythistypeofdatasetisassociatedwithdiscretedatamodels.

Multidimensionaldataset isadatasetthathasmanyindependentvariables clearlyidentified,andoneormoredependentvariablesassociatedtothem.Usuallythistypeofdatasetisassociatedwithcontinuousdatamodels.

Inotherwords,everydataitem(orobject)inacomputerisrepresented(stored)asasetoffeatures.Insteadofthetermfeatureswemayusethetermdimensions,becauseanobjectwith ‐featurescanalsoberepresentedasamultidimensionalpointinan ‐dimensionalspace.Dimensionalityreductionistheprocessofmappingan ‐dimensionalpoint,intoalower ‐dimensionalspace– thisisthemainchallengeinvisualizationsee→Lecture9.

Thenumberofdimensionscansometimesbesmall,e.g.simple1D‐datasuchastemperaturemeasuredatdifferenttimes,to3Dapplicationssuchasmedicalimaging,wheredataiscapturedwithinavolume.Standardtechniques—contouringin2D;isosurfacing andvolumerenderingin3D—haveemergedovertheyearstohandlethissortofdata.Thereisnodimensionreductionissueintheseapplications,sincethedataanddisplaydimensionsessentiallymatch.

33WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Datacanbecategorizedintoqualitative(nominalandordinal)andquantitative(intervalandratio):Intervalandratiodataareparametric,andareusedwithparametrictoolsinwhichdistributionsarepredictable(andoftenNormal).Nominalandordinaldataarenon‐parametric,anddonotassumeanyparticulardistribution.Theyareusedwithnon‐parametrictoolssuchastheHistogram.Theclassicpaperonthetheoryofscalesofmeasurementis(Stevens,1946).

34WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Wecansummarizewhatwelearnedsofaraboutdata:Datacanbenumeric,non‐numeric,orboth.Non‐numericdatacanincludeanythingfromlanguagedata(text)tocategorical,image,orvideodata.Datamayrangefromcompletelystructured,suchascategoricaldata,tosemi‐structured,suchasanXMLFilecontainingmetainformation,tounstructured,suchasanarrative“free‐text”.Note,thattermunstructureddoesnotmeanthatthedataarewithoutanypattern,whichwouldmeancompleterandomnessanduncertainty,butratherthat“unstructureddata”areexpressedso,thatonlyhumanscanmeaningfullyinterpretit.Structureprovidesinformationthatcanbeinterpretedtodeterminedataorganizationandmeaning,henceitprovidesacontext fortheinformation.Theinherentstructureinthedatacanformabasisfordatarepresentation.Animportant,yetoftenneglectedissuearetemporalcharacteristicsofdata: Dataofalltypesmayhaveatemporal(time)association,andthisassociationmaybeeitherdiscreteorcontinuous(Thomas&Cook,2005).InMedicalInformaticswehaveapermanentinteractionbetweendata,informationandknowledge,withdifferentdefinitions(Bemmel&Musen,1997),seeSlide2‐16:

35WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Data arethephysicalentitiesatthelowestabstractionlevelwhichare,e.g.generatedbyapatient(patientdata)orabiologicalprocess(e.g.Omicsdata).Accordingto(Bemmel&Musen,1997)datacontainnomeaning.

Informationisderivedbyinterpretationofthedatabyaclinician(humanintelligence).

Knowledge isobtainedbyinductivereasoningwithpreviouslyinterpreteddata,collectedfrommanysimilarpatientsorprocesses,whichisaddedtothesocalledbodyofknowledgeinmedicine,theexplicitknowledge. Thisknowledgeisusedfortheinterpretationofotherdataandtogainimplicitknowledge whichguidestheclinicianintakingfurtheraction.

36WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Forhypothesisgenerationandtesting,fourtypesofinferencesexist(Peirce,1955):abstraction,abduction,deduction,andinduction.Thefirsttwodrivehypothesisgenerationwhilethelatterdrivehypothesistesting,seeSlide2‐17:Abstractionmeansthatdataarefilteredaccordingtotheirrelevancefortheproblemsolutionandchunkedinschemasrepresentinganabstract descriptionoftheproblem(e.g.,abstractingthatanadultmalewithhaemoglobinconcentrationlessthan14g/dL isananaemicpatient).Followingthis,hypothesesthatcouldaccountforthecurrentsituationarerelatedthroughaprocessofabduction,characterizedbya"backwardflow"ofinferencesacrossachainofdirectedrelationswhichidentifythoseinitialconditionsfromwhichthecurrentabstractrepresentationoftheproblemoriginates.Thisprovidestentativesolutionstotheproblemathandbywayofhypotheses.Forexample,knowingthatdisease willcausesymptom,abductionwilltrytoidentifytheexplanationforB,whiledeductionwillforecastthatapatientaffectedbydisease will

manifestsymptom :bothinferencesareusingthesamerelationalongtwodifferentdirections(Patel&Ramoni,1997).Abduction ischaracterizedbyacyclicalprocessofgeneratingpossibleexplanations(i.e.,identificationofasetofhypothesesthatareabletoaccountfortheclinicalcaseonthebasisoftheavailabledata)andtestingthoseexplanations(i.e.,evaluationofeachgeneratedhypothesisonthebasisofitsexpectedconsequences)fortheabnormalstateofthepatientathand(Patel,Arocha &Zhang,2004).

ThehypothesistestingprocedurescanbeinferredfromSlide2‐17:Generalknowledgeisgainedfrommanypatients,andthisgeneralknowledgeisthenappliedtoanindividualpatient.Wehavetodeterminebetween:Reasoning istheprocessbywhichaclinicianreachesaconclusionafterthinkingaboutallthefacts;Deduction consistsofderivingaparticularvalidconclusionfromasetofgeneralpremises;Induction consistsofderivingalikelygeneralconclusionfromasetofparticularstatements.Reasoninginthe“realworld”doesnotappeartofitneatlyintoanyofthesebasictypes.Therefore,athirdformofreasoninghasbeenrecognizedbyPeirce(1955),wheredeductionandinductionareinter‐mixed;

37WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Thequestion“whatisinformation?”isstillanopenquestioninbasicresearch,andanydefinitionisdependingontheviewtaken.Forexample,thedefinitiongivenbyCarl‐FriedrichvonWeizsäcker:“Informationiswhatisunderstood,”impliesthatinformationhasbothasenderandareceiverwhohaveacommonunderstandingoftherepresentationandthemeanstoconveyinformationusingsomepropertiesofthephysicalsystems,andhisaddendum:“Informationhasnoabsolutemeaning;itexistsrelativelybetweentwosemanticlevels”impliestheimportanceofcontext(Marinescu,2011).Withoutdoubtinformationisafundamentallyimportantconceptwithinourworldandlifeiscomplexinformation,seeSlide2‐14:

Manysystems,e.g.inthequantumworldtonotobeytheclassicalviewofinformation.Inthequantumworldandinthelifesciencestraditionalinformationtheoryoftenfailstoaccuratelydescribereality…forexampleinthecomplexityofalivingcell:Allcomplexlifeiscomposedofeukaryotic(nucleated)cells(Lane&Martin,2010).Agoodexampleofsuchacellistheprotist EuglenaGracilis (inGerman“Augentierchen”)withalengthofapprox.30 .Lifecanbeseenasadelicateinterplayofenergy,entropyandinformation,essentialfunctionsoflivingbeingscorrespondtothegeneration,consumption,processing,preservationandduplicationofinformation.

P:Complexity<>Information<>Energy<> Entropy

38WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

TheetymologicaloriginofthewordinformationcanbetracedbacktotheGreek“forma”andtheLatin“information”and“informare”,tobringsomethingintoashape(“in‐a‐form”).Consequently,thenaivedefinitionincomputerscienceis“informationisdataincontext” andthereforedifferentthandataorknowledge.However,wefollowthenotionof(Boisot &Canals,2004)anddefinethatinformationisanextractionfromdatathat,bymodifyingtherelevantprobabilitydistributions,hasdirectinfluenceonanagent’sknowledgebase.Forabetterunderstandingofthisconcept,wefirstreviewthemodelofhumaninformationprocessingbyWickens (1984):ThemodelbyWickens (1984)beautifullyemphasizesourviewondata,informationandknowledge:thephysicaldatafromthereal‐worldareperceivedasinformationthroughperceptualfilters,controlledbyselectiveattentionandformhypotheseswithintheworkingmemory.Thesehypothesesaretheexpectationsdependingonourpreviousknowledgeavailableinourmentalmodel,storedinthelong‐termmemory.Thesubjectivelybestalternativehypothesiswillbeselectedandprocessedfurtherandmaybetakenasoutcomeforanaction.Duetothefactthatthissystemisaclosedloop,wegetfeedbackthroughnewdataperceivedasnewinformationandtheprocessgoeson.

39WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Theincomingstimulifromthephysicalworldmustpassbothaperceptualfilterandaconceptualfilter.Theperceptualfilter orientatesthesenses(e.g.visualsense)tocertaintypesofstimuliwithinacertainphysicalrange(e.g.visualsignalrange,pre‐knowledge,attentionetc.).Onlythestimuliwhichpassthroughthisfiltergetregisteredasincomingdata–everythingelseisfilteredout.Atthispointitisimportanttofollowourphysicalprincipleofdata:todifferentiatebetweentwonotionsthatarefrequentlyconfused:anexperiment’s(raw,hard,measured,factual)dataandits(meaningful,subjective)interpretedinformationresults.Dataarepropertiesconcerningonlytheinstrument;itistheexpressionofafact. Theresultconcernsapropertyoftheworld.Thefollowingconceptualfiltersextractinformation‐bearingdatafromwhathasbeenpreviouslyregistered.Bothtypesoffiltersareinfluencedbytheagents’cognitiveandaffectiveexpectations,storedintheirmentalmodels.Theenormousutilityofdataresidesinthefactthatitcancarryinformationaboutthephysicalworld.Thisinformationmaymodifysetexpectationsorthestate‐of‐knowledge.Theseprinciplesallowanagent toactinadaptivewaysinthephysicalworld(Boisot &Canals,2004).Conferthisprocesswiththehumaninformationprocessingmodelby(Wickens,1984),seeninSlide2‐19anddiscussedin→Lecture7.

40WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Entropyhasmanydifferentdefinitionsandapplications,originallyinstatisticalphysicsandmostoftenitisusedasameasurefordisorder.Ininformationtheory,entropycanbeusedasameasurefortheuncertaintyinadataset.

To demonstratehowusefulentropycanbe‐ youcanhavealookatthispaper:Holzinger,A.,Stocker,C.,Peischl,B.&Simonic,K.‐M.2012.OnUsingEntropyforEnhancingHandwritingPreprocessing.Entropy,14,(11),2324‐2350.http://www.mdpi.com/1099‐4300/14/11/2324

41WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Theconceptofentropywasfirstintroducedinthermodynamics(Clausius,1850),whereitwasusedtoprovideastatementofthesecondlawofthermodynamics.Later,statisticalmechanicsprovidedaconnectionbetweenthemacroscopicpropertyofentropyandthemicroscopicstateofasystembyBoltzmann.Shannonwasthefirsttodefineentropyandmutualinformation.

Shannon(1948)usedaGedankenexperiment(thoughtexperiment)toproposeameasureofuncertaintyinadiscretedistributionbasedontheBoltzmannentropyofclassicalstatisticalmechanics,seeSlide2‐22:

42WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Anexampleshalldemonstratetheusefulnessofthisapproach:1)Let beadiscretedatasetwithassociatedprobabilities :Eq.2‐5

… , … ,

2)NowweapplyShannon’sequationEq.2‐4:Eq.2‐6

3)Weassumethatoursourcehastwovalues(ball=white,ball=black)LetusdothefamoussimpleGedankenexperiment(thoughtexperiment):Imagineaboxwhichcancontaintwocoloredballs:blackandwhite.Thisisoursetofdiscretesymbolswithassociatedprobabilities.Ifwegrabblindlyintothisboxtogetaball,wearedealingwithuncertainty,becausewedonotknowwhichballwetouch.Wecanask:Istheballblack?NO.THENitmustbewhite,soweneedonequestiontosurelyprovidetherightanswer.Becauseitisabinarydecision(YES/NO)themaximumnumberof(binary)questionsrequiredtoreducetheuncertaintyis:log ,where isthenumberofthepossibleoutcomes.Ifthereare eventswithequalprobability then 1/ .Ifyouhaveonly1blackball,thenlog 1 0,whichmeansthereisnouncertainty.Eq.2‐7

, with , 14)NowwesolvenumericallyEq.2‐6:Eq.2‐8

∗ log1

∗ log1

1Since rangesfrom0(forimpossibleevents)to1(forcertainevents),theentropyvaluerangesfrominfinity(forimpossibleevents)to0(forcertainevents).So,wecansummarizethattheentropyistheweightedaverageofthesurpriseforallpossibleoutcomes.Forourexamplewiththetwoballswecandrawthefollowingfunction:Theentropyvalueis1for =0,5anditisboth0foreither =0or =1.Thisexamplemightseemtrivial,buttheentropyprinciplehasbeendevelopedalotsinceShannonandtherearemanydifferentmethods,whichareveryusefulfordealingwithdata.

43WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Shannoncalledittheinformationentropy (akaShannonentropy)anddefined:Eq.2‐9

log1

log

where istheprobabilityoftheeventoccurring.If isnotidenticalforalleventsthentheentropy isaweightedaverageofallprobabilities,whichShannondefinedas:Eq.2‐10

2

Basically,theentropyp(x)approacheszeroifwehaveamaximumofstructure– andopposite,theentropyp(x)reacheshighvaluesifthereisnostructure– hence,ideally,iftheentropyisamaximum,wehavecompleterandomness,totaluncertainty.LowEntropymeansdifferences,structure,individuality– highEntropymeansnodifferences,nostructure,noindividuality.Consequently,lifeneedslowentropy.

44WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Theprinciplewhatwecaninferfromentropyvaluesis:1)Lowentropy valuesmeanhighprobability,highcertainty, henceahighdegreeofstructurization inthedata.2)Highentropy valuesmeanlowprobability,lowcertainty (≅ highuncertainty;‐),hencealowdegreeofstructurization inthedata.Maximumentropywouldmeancompleterandomnessandtotaluncertainty.Highlystructureddatacontainlowentropy;ideallyifeverythingisinorderandthereisnosurprise(nouncertainty)theentropyislow:Eq.2‐11

0

Eq.2‐12 log .

Ontheotherhandifthedataareweaklystructured– asforexampleinbiologicaldata–andthereisnoabilitytoguess(alldataisequallylikely)theentropyishigh:Ifwefollowthisapproach,“unstructureddata”wouldmeancompleterandomness.Letuslookonthehistoryofentropytounderstandwhatwecandoinfuture,seeSlide2‐25.

45WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Youmight arguewhatthepracticalpurposeofthisapproachis– manifoldapplications!

Example:Heartratevariabilityisthevariationofthetimeintervalbetweenconsecutiveheartbeats.Entropyisacommonlyusedtooltodescribetheregularityofdatasets.Entropyfunctionsaredefinedusingmultipleparameters,theselectionofwhichiscontroversialanddependsontheintendedpurpose.Mayeretal. (2014)describetheresultsoftestsconductedtosupportparameterselection,towardsthegoalofenablingfurtherbiomarkerdiscovery.Theydealtwithapproximate,sample,fuzzy,andfuzzymeasureentropies.AlldatawereobtainedfromPhysioNet https://www.physionet.org,afree‐access,on‐linearchiveofphysiologicalsignals,andrepresentvariousmedicalconditions.Fivetestsweredefinedandconductedtoexaminetheinfluenceof:varyingthethresholdvaluer(asmultiplesofthesamplestandarddeviation?,ortheentropy‐maximizingrChon),thedatalengthN,theweightingfactorsnforfuzzyandfuzzymeasureentropies,andthethresholdsrF andrL forfuzzymeasureentropy.TheresultsweretestedfornormalityusingLilliefors'compositegoodness‐of‐fittest.Consequently,thep‐valuewascalculatedwitheitheratwosamplet‐testoraWilcoxonranksumtest.Thefirsttestshowsacross‐overofentropyvalueswithregardtoachangeofr.Thus,aclearstatementthatahigherentropycorrespondstoahighirregularityisnotpossible,butisratheranindicatorofdifferencesinregularity.Nshouldbeatleast200datapointsforr=0.2?andshouldevenexceedalengthof1000forr=rChon.Theresultsfortheweightingparametersnforthefuzzymembershipfunctionshowdifferentbehaviorwhencoupledwithdifferentrvalues,thereforetheweightingparametershavebeenchosenindependentlyforthedifferentthresholdvalues.ThetestsconcerningrF andrL showedthatthereisnooptimalchoice,butr=rF =rL isreasonablewithr=rChon orr=0.2?.CONCLUSIONS:Someofthetestsshowedadependencyofthetestsignificanceonthedataathand.Nevertheless,asthemedicalconditionsareunknownbeforehand,compromiseshadtobemade.Optimalparametercombinationsaresuggestedforthemethodsconsidered.Yet,duetothehighnumberofpotentialparametercombinations,furtherinvestigationsofentropyforheartratevariabilitydatawillbenecessary.Mayer,C.,Bachler,M.,Hortenhuber,M.,Stocker,C.,Holzinger,A.&Wassertheurer,S.2014.Selectionofentropy‐measureparametersforknowledgediscoveryinheartratevariabilitydata.BMCBioinformatics,15,(Suppl 6),S2.http://www.ncbi.nlm.nih.gov/pubmed/25078574

46WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

TheoriginmaybefoundintheworkofJakob Bernoulli,describingtheprincipleofinsufficientreason:weareignorantofthewaysaneventcanoccur,theeventwilloccurequallylikelyinanyway.ThomasBayes(1763)andPierre‐SimonLaplace(1774)carriedonandHaroldJeffreys andDavidCoxsolidifieditintheBayesianStatistics,akastatisticalinference.ThesecondpathleadingtotheclassicalMaximumEntropy,en‐routewiththeShannonEntropy,canbeidentifiedwiththeworkofJamesClerkMaxwellandLudwigBoltzmann,continuedbyWillardGibbsandfinallyClaudeElwoodShannon.Thisworkisgearedtowarddevelopingthemathematicaltoolsforstatisticalmodelingofproblemsininformation.Thesetwoindependentlinesofresearchareverysimilar.Theobjectiveofthefirstlineofresearchistoformulateatheory/methodologythatallowsunderstandingofthegeneralcharacteristics (distribution)ofasystemfrompartialandincompleteinformation.Inthesecondrouteofresearch,thesameobjectiveisexpressedasdetermininghowtoassign(initial)numericalvaluesofprobabilitieswhenonlysome(theoretical)limitedglobalquantitiesoftheinvestigatedsystemareknown.RecognizingthecommonbasicobjectivesofthesetwolinesofresearchaidedJaynes inthedevelopmentofhisclassicalwork,theMaximumEntropyformalism.Thisformalismisbasedonthefirstlineofresearchandthemathematicsofthesecondlineofresearch.TheinterrelationshipbetweenInformationTheory,statisticsandinference,andtheMaximumEntropy(MaxEnt)principlebecameclearin1950ies,andmanydifferentmethodsarosefromtheseprinciples(Golan,2008),seenextSlide

47WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

MaximumEntropy(MaxEn),describedby(Jaynes,1957),isusedtoestimateunknownparametersofamultinomialdiscretechoiceproblem,whereastheGeneralizedMaximumEntropy(GME)modelincludesnoisetermsinthemultinomialinformationconstraints.Eachnoisetermismodeledasthemeanofafinitesetofaprioriknownpointsintheinterval 1,1 withunknownprobabilitieswherenoparametricassumptionsabouttheerrordistributionaremade.AGMEmodelforthemultinomialprobabilitiesandforthedistributions,associatedwiththenoisetermsisderivedbymaximizingthejointentropyofmultinomialandnoisedistributions,undertheassumptionofindependence(Jaynes,1957).TopologicalEntropy (TopEn),wasintroducedby(Adler,Konheim &McAndrew,1965)withthepurpose tointroducethenotionofentropyasaninvariantforcontinuousmappings:Let , beatopologicaldynamicalsystem,i.e.,let beanonemptycompactHausdorff spaceand : → acontinuousmap;theTopEn isanonnegativenumberwhichmeasuresthecomplexity ofthesystem(Adler,Downarowicz &Misiurewicz,2008).GraphEntropywasdescribedby(Mowshowitz,1968)tomeasurestructuralinformationcontentofgraphs,andadifferentdefinition,morefocusedonproblemsininformationandcodingtheory,wasintroducedby(Körner,1973).Graphentropyisoftenusedforthecharacterizationofthethe structureofgraph‐basedsystems,e.g.inmathematicalbiochemistry.Intheseapplicationstheentropyofagraphisinterpretedasitsstructuralinformationcontentandservesasacomplexitymeasure,andsuchameasureisassociatedwithanequivalencerelationdefinedonafinitegraph;byapplicationofShannon’sEq.2.4withtheprobabilitydistributionwegetanumericalvaluethatservesasanindexofthestructuralfeaturecapturedbytheequivalencerelation(Dehmer&Mowshowitz,2011).

MinimumEntropy (MinEn),describedby(Posner,1975),providesustheleastrandom,andtheleastuniformprobabilitydistributionofadataset,i.e.theminimumuncertainty,whichisthelimitofourknowledgeandofthestructureofthesystem.Often,theclassicalpatternrecognitionisdescribedasaquestforminimumentropy.Mathematically,itismoredifficulttodetermineaminimumentropyprobabilitydistributionthanamaximumentropyprobabilitydistribution;whilethelatterhasaglobalmaximumduetotheconcavityoftheentropy,theformerhastobeobtainedbycalculatingalllocalminima,consequentlytheminimumentropyprobabilitydistributionmaynotexistinmanycases(Yuan&Kesavan,1998).CrossEntropy (CE),discussedby(Rubinstein,1997),wasmotivatedbyanadaptivealgorithmforestimatingprobabilitiesofrareeventsincomplexstochasticnetworks,whichinvolvesvarianceminimization.CEcanalsobeusedforcombinatorialoptimizationproblems(COP).Thisisdonebytranslatingthe“deterministic”optimizationproblemintoarelated“stochastic”optimizationproblemandthenusingrareeventsimulationtechniques(DeBoeretal.,2005).Rényi entropy isageneralizationoftheShannonentropy(informationtheory),andTsallis entropyisageneralizationofthestandardBoltzmann–Gibbsentropy(statisticalphysics).Forusmoreimportantare:ApproximateEntropy(ApEn),describedby(Pincus,1991),isuseabletoquantifyregularityindatawithoutanyaprioriknowledgeaboutthesystem,seeanexampleinSlide2‐20.SampleEntropy(SampEn),wasusedby(Richman&Moorman,2000)foranewrelatedmeasureoftimeseriesregularity.SampEn wasdesignedtoreducethebiasofApEn andisbettersuitedfordatasetswithknownprobabilisticcontent.

48WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Problem:Monitoringbodymovementsalongwithvitalparametersduringsleepprovidesimportantmedicalinformationregardingthegeneralhealth,andcanthereforebeusedtodetecttrends(largeepidemiologystudies)todiscoversevereillnessesincludinghypertension(whichisenormouslyincreasinginoursociety).Thisseeminglysimpledata– onlyfromonenightperiod– demonstratesthecomplexityandtheboundariesofstandardmethods(forexampleFastFourierTransformation)todiscoverknowledge(forexampledeviations,similaritiesetc.).Duetothecomplexityanduncertaintyofsuchdatasets,standardmethods(suchasFFT)comprisethedangerofmodelingartifacts.Sincetheknowledgeofinterestformedicalpurposesisinanomalies(alterations,differences,a‐typicalities,irregularities),theapplicationofentropicmethodsprovidesbenefits.PhotographtakenduringtheEUProjectEMERGEandusedwithpermission.

49WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

1)Wehaveagivendataset wherecapital isthenumberofdatapoints:Eq.2‐13

, , … ,

2)Nowweformm‐dimensionalvectorsEq.2‐14

, , … ,3)Wemeasurethedistancebetweeneverycomponent,i.e.themaximumabsolutedifferencebetweentheirscalarcomponentsEq.2‐15

, max, ,…,

4)Welook– sotosay– inwhichdimensionisthebiggestdifference;asaresultwegettheApproximateEntropy(ifthereisnodifferencewehavezerorelativeentropy):Eq.2‐16

ApEn , lim →

where istherunlengthand isthetolerancewindow (letusassumethat isequalto ),ApEn (m,r)couldalsobewrittenasH ,5) iscomputedbyEq.2‐17

1 1

ln

withEq.2‐18

1

6) measureswithinthetolerance theregularityofpatternssimilartoagivenoneofwindowlength7)Finallyweincreasethedimensionto 1 andrepeatthestepsbeforeandgetasaresulttheapproximateentropyApEn ,ApEn , , isapproximatelythenegativenaturallogarithmoftheconditionalprobability(CP)thatadatasetoflength,havingrepeateditselfwithinatolerance for points,willalsorepeatitselffor 1 points.Animportantpointto

keepinmindabouttheparameter isthatitiscommonlyexpressedasafractionoftheStandarddeviation(SD)ofthedataandinthiswaymakesApEn ascale‐invariantmeasure.Alowvaluearisesfromahighprobabilityofrepeatedtemplatesequencesinthedata(Hornero etal.,2006).

50WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Inthisslidewecanseetheplotofthenormalizedapproximateentropyforeachoftheepisodesandthemedianacrossalltheepisodes.Fromthisfigurewecanseethattheentropyisaminimumwherewehavenoalterationsandentropyisincreasingwhenhavingirregularities.Ifwehavenodifferenceswegetzeroentropy

51WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Afinalexampleshouldmaketheadvantageofsuchanentropymethodtotallyclear:Intherightdiagramitishardtodiscoverirregularitiesforamedicalprofessional–especiallyoveralongerperiod,butananomalycaneasilybedetectedbydisplayingthemeasuredrelativeApEn.Whatcanwelearnfromthisexperiment?Approximateentropyisrelativelyunaffectedbynoise;itcanbeappliedtocomplextimeserieswithgoodreproduction;itisfiniteforstochastic,noisy,compositeprocesses;thevaluescorresponddirectlytoirregularities;anditisapplicabletomanyotherareas– forexamplefortheclassificationoflargesetsoftexts– theabilitytoguessalgorithmicallythesubjectofatextcollectionwithouthavingtoreaditwouldpermitautomatedclassification.

52WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

A. Holzinger    LV 709.049 Med. Informatik                                

WS 2015 53

Algorithm 1:WithrotateDataPoints”definedwecancalculatetheprojectionprofilepy;j( forarangeofdifferentanglesandwiththosewecancomputetheentropyHy, (X)foreachangle.Holzinger etal.(2012)implementedthisalgorithmasitisdescribedinAlgorithm2.Formoreinformationplease refertothepaper.Thisjustshalldemonstratethestrengthsandweaknessesofusingentropyforskew‐ andslant‐correction– whichisimportantinhandwritingrecognition.However,theentropybasedskewcorrectiondoesnotoutperformoldermethodslikeskewcorrectionbasedontheleastsquaresmethod:thenoiseinthedrawingdistortstherealminimaoftheentropydistribution.Inmanycaseswheretheglobalminimumwasthewrongchoice,therewasalocalminimumclosetotherealerrorangle.Eventhoughbothapproachesyieldsatisfyingresultforwordslongerthanfiveletters,wesuggestfurtherinvestigationintotheentropy‐basedskewcorrectionmethod,withnoisereductioninmind.Ontheotherhand,itshowsthatentropyisinfactusefulwhenperformingslantcorrection,asitdoesoutperformthewindow‐basedapproach!Theconclusionis,thatthewindow‐basedmethodistoomuchdependentonanumberoffactors.Itsperformanceisinfluencedalotbytheoutcomeofzonedetectionandbythewritingstyleofthewriter.Itisalsoinfluencedalotbywindowselection.Holzinger,A.,Stocker,C.,Peischl,B.&Simonic,K.‐M.2012.OnUsingEntropyforEnhancingHandwritingPreprocessing.Entropy,14,(11),2324‐2350

A. Holzinger    LV 709.049 Med. Informatik                                

WS 2015 54

Whatcanwelearnfromthisexperiment?Approximateentropyisrelativelyunaffectedbynoise;itcanbeappliedtocomplextimeserieswithgoodreproducibility;itisfiniteforstochastic,noisy,compositeprocesses;thevaluescorresponddirectlytoirregularities;

anditisapplicabletomanyotherareas– forexamplefortheclassificationoflargesetsoftexts– theabilitytoguessalgorithmicallythesubjectofatextcollectionwithouthavingtoreaditwouldpermitautomatedclassification.

55WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

56

My DEDICATION is to make data valuable … Thank you!

WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

57WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

58WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

MFC=MinimumFoot ClearanceStride=stepYoucanseebrilliantlywhatyoucanmeasurewithentropy– youcandetermineanomalies,i.e.thebalanceproblemsofelderlygait

MFCPoincaré plots.ToppanelsshowMFCtimeseriesfromahealthyelderlysubject(A)anditscorrespondingPoincaré plot(B).BottompanelsshowMFCtimeseriesfromanelderlysubjectwithbalanceproblem(C)anditscorrespondingPoincaré plot(D).

SignificantrelationshipsofmeanMFCwithPoincaré plotindexes(SD1,SD2)andApEn (r=0.70,p<0.05;r=0.86,p<0.01;r=0.74,p<0.05)werefoundinthefalls‐riskelderlygroup.Ontheotherhand,suchrelationshipswereabsentinthehealthyelderlygroup.Incontrast,theApEn valuesofMFCdataseriesweresignificantly(p<0.05)correlatedwithPoincaré plotindexesofMFCinthehealthyelderlygroup,whereascorrelationswereabsentinthefalls‐riskgroup.TheApEn valuesinthefalls‐riskgroup(meanApEn =0.18± 0.03)wassignificantly(p<0.05)higherthanthatinthehealthygroup(meanApEn =0.13± 0.13).ThehigherApEn valuesinthefalls‐riskgroupmightindicateincreasedirregularitiesandrandomnessintheirgaitpatternsandanindicationoflossofgaitcontrolmechanism.ApEn valuesofrandomlyshuffledMFCdataoffallsrisksubjectsdidnotshowanysignificantrelationshipwithmeanMFC.

59WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

60WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

61WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Surrogatedatarecords.AandBshowthemajorcomponents.A:themeanprocess,whichhassetpointandspikemodes.B:thebaselineprocess,heremeaningtheheartratevariability,modeledasGaussianrandomnumbers.C:theirsum,asurrogatedatarecord.D–F:amorerealisticsurrogatewiththesamefrequencycontentastheobserveddata.D:aclinicallyobserveddatarecordof4,096R‐Rintervals.Thelefthand ordinateislabeledinms andtherighthand ordinateinSD.E:a4,096‐pointisospectral surrogatedatasetformedusingtheinverseFouriertransformoftheperiodogram ofthedatainD.F:thesurrogatedataafteradditionofaclinicallyobserveddecelerationlasting50pointsandscaledsothatthevarianceoftherecordisincreasedfrom1to2.

62WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

63WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

64WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

65WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

66WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_entropy_sect018.htm

Wheremanyotherlanguagesrefertotables,rows,andcolumns/fields,SASusesthetermsdatasets,observations,andvariables.ThereareonlytwokindsofvariablesinSAS:numericandcharacter(string).Bydefaultallnumericvariablesarestoredas(8byte)real.Itispossibletoreduceprecisioninexternalstorageonly.DateanddatetimevariablesarenumericvariablesthatinherittheCtraditionandarestoredaseitherthenumberofdays(fordatevariables)orseconds(fordatetime variables).

http://www.sas.com/technologies/analytics/statistics/stat/index.html

67WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Hadoop andtheMapReduce programmingparadigmalreadyhaveasubstantialbaseinthebioinformaticscommunity– inparticular inthefieldofhigh‐throughput next‐generationsequencinganalysis.Thisisduetothecost‐effectivenessofHadoop‐basedanalysisoncommodityLinuxclusters,andinthecloudviadatauploadtocloudvendorswhohaveimplementedHadoop/HBase;andduetotheeffectivenessandease‐of‐useoftheMapReduce methodinparallelizationofmanydataanalysisalgorithms.

68WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Thechallengewefaceisthatanestimatedaverageof5%ofdataarestructured,therestiseithersemi‐structured,weaklystructuredandmostofourdataisunstructured.Maybethemostimportantfield forthefutureisdatamining– especiallynoveltechniquesofdatamining,includingbothtimeandspace(e.g.graph‐based,entropy‐based,topological‐baseddataminingapproaches).Readmorehere:Holzinger,A.2014.ExtravaganzaTutorialonHotIdeasforInteractiveKnowledgeDiscoveryandDataMininginBiomedicalInformatics.In:Slezak,D.,Tan,A.‐H.,Peters,J.F.&Schwabe,L.(eds.)BrainInformaticsandHealth,BIH2014,LectureNotesinArtificialIntelligence,LNAI8609.Heidelberg,Berlin:Springer,pp.502‐515.https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=764238&pCurrPk=79139

69WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

70WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

http://minnesotafuturist.pbworks.com/w/page/21441129/DIKW

Afunnydescription ofdatainformationknowledge.

71WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Avery placative image.Nicetolookat– buttheusefulnessisquestionable.

72WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Allthismodelsareveryquestionable. PleaserememberthatwefollowinourlecturethenotionofBoisot &Canals.

73WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

Theinterestingissue ofthisgraphicisthatitincludesatime‐axis,whichisimportantfordecisionmakingandpredictiveanalytics.“Pastbehaviourisagoodpredictorforfuturebehaviour”Althoughthis isaoversimplification,scientistswhostudyhumanbehavioragreethatpastbehaviormaybeausefulmarkerforfuturebehavior,however,onlyundercertainspecificconditions. Readmore:Ajzen,I.1991.Thetheoryofplannedbehavior.Organizationalbehaviorandhumandecisionprocesses,50,(2),179‐211.

74WS 2015

A. Holzinger    LV 709.049 Med. Informatik                                

top related