a. holzinger lv709human-centered.ai/wordpress/wp-content/uploads/... · the elements of statistical...

78
Status as of 08.11.2015 10:00 Dear Students, welcome to the 5th lecture of our course. Please remember from the last lecture the basic architecture of a hospital information system, the complexity of medical workflows, the challenges of data integration, data fusion, data curation; the building blocks of hospital information systems, databases, data warehouses, data marts; the difference between knowledge discovery and information retrieval; please remember the formal description of a information retrieval model – the best practice example is the Page‐Rank Algorithm, see: Hastie, T., Tibshirani, R. & Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer. Or have a look to the reprint paper: Brin, S. & Page, L. 2012. Reprint of: The anatomy of a large‐scale hypertextual web search engine. Computer Networks, 56, (18), 3825‐3833. http://www.sciencedirect.com/science/article/pii/S1389128612003611 doi:10.1016/j.comnet.2012.10.007 Please always be aware of the definition of biomedical informatics (Medizinische Informatik): Biomedical Informatics is the inter‐disciplinary field that studies and pursues the effective use of biomedical data, information, and knowledge for scientific inquiry, problem solving, and decision making, motivated by efforts to improve human health (and well‐being). 1 WS 2015 A. Holzinger LV709.049 11.11.2015

Upload: others

Post on 24-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Status asof08.11.201510:00

Dear Students,welcometothe5thlectureofourcourse.Pleaserememberfromthelastlecturethebasicarchitectureofahospitalinformationsystem,thecomplexityofmedicalworkflows,thechallengesofdataintegration,datafusion,datacuration;thebuildingblocksofhospitalinformationsystems,databases,datawarehouses,datamarts;thedifferencebetweenknowledgediscoveryandinformationretrieval;pleaseremembertheformaldescriptionofainformationretrievalmodel– thebestpracticeexampleisthePage‐RankAlgorithm,see:Hastie,T.,Tibshirani,R.&Friedman,J.2009.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.SecondEdition,NewYork,Springer.Orhavealooktothereprintpaper:Brin,S.&Page,L.2012.Reprintof:Theanatomyofalarge‐scalehypertextual websearchengine.ComputerNetworks,56,(18),3825‐3833.http://www.sciencedirect.com/science/article/pii/S1389128612003611doi:10.1016/j.comnet.2012.10.007

Pleasealwaysbeawareofthedefinitionofbiomedicalinformatics(MedizinischeInformatik):BiomedicalInformatics istheinter‐disciplinaryfieldthatstudiesandpursuestheeffectiveuseofbiomedicaldata,information,andknowledgeforscientificinquiry,problemsolving,anddecisionmaking,motivatedbyeffortstoimprovehumanhealth(and well‐being).

1WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 2: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

2WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 3: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

3WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 4: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

4WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 5: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Invivo(Latinfor"withintheliving")isexperimentationusingawhole,livingorganismasopposedtoapartialordeadorganism,oraninvitro("withintheglass",i.e.,inatesttubeorpetridish)controlledenvironment.

5WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 6: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

6WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 7: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

7WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 8: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

8WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 9: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Itiswidelyacknowledgedinmachinelearningthattheperformanceofalearningalgorithmisdependentonbothitsparametersandthetrainingdata.Yet,thebulkofalgorithmicdevelopmenthasfocusedonadjustingmodelparameterswithoutfullyunderstandingthedatathatthelearningalgorithmismodeling.Assuch,algorithmicdevelopmentforclassificationproblemshaslargelybeenmeasuredbyclassificationaccuracy,precision,orasimilarmetriconbenchmarkdatasets.Asmostmachinelearningresearchisfocusedonthedatasetlevel,oneisconcernedwithmaximizingp(h|t),whereh:X→YisahypothesisorfunctionmappinginputfeaturevectorsXtotheircorrespondinglabelvectorsY,andt={(xi,yi):xi∈X∧yi ∈Y}isatrainingset.

Oneofthemethodsforprivacypreservingdataminingisthatofanonymization,inwhicharecordisreleasedonlyifitisindistinguishablefromkotherentitiesinthedata.Wenotethatmethodssuchask‐anonymityarehighlydependentuponspatiallocalityinordertoeffectivelyimplementthetechniqueinastatisticallyrobustway.Inhighdimensionalspacethedatabecomessparse,andtheconceptofspatiallocalityisnolongereasytodefinefromanapplicationpointofview.Aggarwal,C.C.Onk‐anonymityandthecurseofdimensionality.Proceedingsofthe31stinternationalconferenceonVerylargedatabasesVLDB,2005.901‐909.

Holzinger,A.,Stocker,C.&Dehmer,M.2014.BigComplexBiomedicalData:TowardsaTaxonomyofData.In:Obaidat,M.S.&Filipe,J.(eds.)CommunicationsinComputerandInformationScienceCCIS455.BerlinHeidelberg:Springerpp.3‐18.

A. Holzinger                                                        LV709.049                                                 11.11.2015

WS 2015 9

Page 10: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

https://www.projectrhea.org/rhea/index.php/File:Complexitytable.png

Pstandsfor“polynomialtime”.Thisthesubsetofproblemsthatcanbeguaranteedtobesolvedinapolynomialamountoftimerelatedtotheirinputlength.ProblemsinPcommonlyoperateonsingleinputs,lists,ormatrices,andcanoccasionallyapplytographs.Thetypicaltypesofoperationstheyperformaremathematicaloperators,sorting,findingminimumandmaximumvalues,determinates,andmanyothers.NPstandsfor“nondeterministicpolynomialtime”.Theseproblemsareonesthatcanbesolvedinpolynomialtimeusinganondeterministiccomputer.Thisconceptisalittlehardertounderstand,soanotherdefinitionthatisaconsequenceofthefirstisoftenused.NPproblemsareproblemsthatcanbechecked,or“certified”,inpolynomialtime.TheoutputofanNPsolvingprogramiscalledacertificate,andthepolynomialtimeprogramthatchecksthecertificateforitsvalidityiscalledthecertificationprogram.NP‐hard:AproblemisNP‐hardifitasleastashardasthehardestproblemsknowntobeNP.Thisleadstotwopossibilities:eithertheproblemisinNPandalsoconsideredNP‐hard,oritismoredifficultthananyNPproblem.NP‐complete:ThisclassificationistheintersectionofNPandNP‐hard.IfaproblemisinNPandalsoNP‐hard,thenitisconsideredNP‐complete.Thisclassofproblemsisarguablythemostinterestingforitsconsequencesonmanyothertypesofproblems.

For thosewhowanttogodeeperintocomplexitytheory,thereisexcellentMITOpenCoursewarebyEricDemaine,http://erikdemaine.org/https://www.youtube.com/watch?v=moPtwq_cVH8Youcandosomeownexperimentationviahttp://www.algomation.com

A. Holzinger                                                        LV709.049                                                 11.11.2015

WS 2015 10

Page 11: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Keyproblemsindealingwithdatainthelifesciencesinclude:• Complexityofourworld• High‐dimensionality(curseofdimensionality(Catchpoole etal.,2010))• Mostofthedataisweakly‐structuredandunstructured

Agrandchallengeinhealthcareisthecomplexityofdata,implicatingtwoissues:structurization andstandardization.Aswehavelearnedinlecture2,verylittleofthedataisstructured.Mostofourdataisweaklystructured(Holzinger,2012).Inthelanguageofbusinessthereisoftentheuseoftheword“unstructured”,butwehavetousethiswordwithcare;unstructuredwouldmean– inastrictmathematicalsense– thatwearetalkingabouttotalrandomnessandcompleteuncertainty,whichwouldmeannoise,wherestandardmethodsfailorleadtothemodelingofartifacts,andonlystatisticalapproachesmayhelp.Thecorrecttermwouldbeunmodeled data– orweshallspeakaboutunstructuredinformation.Pleasemindthedifferences.

Totheimageabove:Advancesingeneticsandgenomicshaveacceleratedthediscovery‐based(=hypothesesgenerating)researchthatprovidesapowerfulcomplementtothedirecthypothesis‐drivenmolecular,cellularandsystemssciences.Forexample,geneticandfunctionalgenomicstudieshaveyieldedimportantinsightsintoneuronalfunctionanddisease.Oneofthemostexcitingandchallengingfrontiersinneuroscienceinvolvesharnessingthepoweroflarge‐scalegenetic,genomicandphenotypicdatasets,andthedevelopmentoftoolsfordataintegrationanddatamining(Geschwind &Konopka,2009).

11WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 12: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Donotconfusestructurewithstandardization(seeSlide2‐9).Datacanbestandardized(e.g.numericalentriesinlaboratoryreports)andnon‐standardized.Atypicalexampleisnon‐standardizedtext– impreciselycalled“Free‐Text”or“unstructureddata”inanelectronicpatientrecord(Kreuzthaleretal.,2011).

Standardizeddata isthe basisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandards canensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.

Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andiv)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem(refertoIOM).Technicalelementsfordatasharingrequirestandardizationofidentification,recordstructure,terminology,messaging,privacyetc.ThemostusedstandardizeddatasettodateistheinternationalClassificationofDiseases(ICD),whichwasfirstadoptedin1900forcollectingstatistics(Ahmadian etal.,2011),whichwewilldiscussin→Lecture3.Non‐standardizeddata isthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Well‐structureddata istheminorityofdataandanidealisticcasewheneachdataelementhasanassociateddefinedstructure,relationaltables,ortheresourcedescriptionframeworkRDF,ortheWebOntologyLanguageOWL(see→Lecture3).Note:Ill‐structured isatermoftenusedfortheoppositeofwell‐structured,althoughthistermoriginallywasusedinthecontextofproblemsolving(Simon,1973).Semi‐structuredisaformofstructureddatathatdoesnotconformwiththestrictformalstructureoftablesanddatamodelsassociatedwithrelationaldatabasesbutcontainstagsormarkerstoseparatestructureandcontent,i.e.areschema‐lessorself‐describing;atypicalexampleisamarkup‐languagesuchasXML(see→Lecture3and4).Weakly‐Structureddata isthemostofourdatainthewholeuniverse,whetheritisinmacroscopic(astronomy)ormicroscopicstructures(biology)– see→Lecture5.Non‐structureddata orunstructureddata isanimprecisedefinitionusedforinformation expressedinnaturallanguage,whennospecificstructurehasbeendefined.Thisisanissuefordebate:Texthasalsosomestructure:words,sentences,paragraphs.Ifweareveryprecise,unstructureddatawouldmeantthatthedataiscompleterandomized– whichisusuallycallednoiseandisdefinedby(Duda,Hart&Stork,2000)asanypropertyofdatawhichisnotduetotheunderlyingmodelbutinsteadtorandomness(eitherintherealworld,fromthesensorsorthemeasurementprocedure).

12WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 13: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Alookonthetypicalviewofanhospitalinformationsystemshowsustheorganizationofwell‐structureddata:Standardizedandwell‐structureddataisthebasisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandardscanensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.Remember:Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andd)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem.Note:Theopposite,i.e.non‐standardizeddataisthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Remark:Care2xisanOpenSourceInformationSystem,see:http://care2x.orgSee→Lecture10formoredetails.

13WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 14: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

ThisisaMedicalexampleforsemi‐structureddatainXML(Holzinger,2003).TheeXtensible MarkupLanguage(XML)isaflexibletextformatrecommendedbytheW3CfordataexchangeandderivedfromSGML(ISO8879),(Usdin &Graham,1998).XMLisoftenclassifiedassemi‐structured,howeverthisisinsomewaymisleading,asthedataitselfisstillstructured,butinaflexibleratherthanastaticway(Forster&Vossen,2012).Suchdatadoesnotconformtotheformalstructureoftablesanddatamodelsasforexampleinrelationaldatabases,butatleastcontainstags/markerstoseparatesemanticelementsandenforcehierarchiesofrecordsandfieldswithinthesedata.

14WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 15: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Thisexampleby(Rassinoux etal.,2003)showshowXMLcanbeusedinthehospitalinformationsystem:ThestructureofanynewdocumenteditedinthePatientRecord(here:DPI)isbasedonatemplatedefinedinXMLformat(left).ThesetemplatesplaytheroleofDTDsorXMLschemasastheypreciselydefinethestructureandcontenttypeofeachparagraph,thusvalidatingthedocumentattheapplicationlevel.Suchastructureembedsa<HEADER>anda<BODY>.Theheaderencapsulatesthepropertiesthatareinherenttothenewdocumentandthatwillbeusefultofurtherclassifyit,accordingtovariouscriteria,including:thepatientidentification,thedocumenttype,theidentifierofitsredactorsandofthehospitalizationstayorambulatoryconsultationtowhichthedocumentwillbeattachedinthepatienttrajectory,etc.Thebodyencapsulatesthecontent,andisdividedintotwoparts:The<STRUCDOC>partdescribesthesemanticentitiesthatcomposethedocument.The<FULLDOC>partembedsthedocumentitselfwithitspagelayoutinformation,whichcanbestoredeitherasadraft,atemporarytextorasadefinitivetext.Thisformatguaranteesthestorageofdynamicandcontrolledfieldsfordatainput,thusallowingthecombinationoffreetextandstructureddataentryinthedocument.Oncethedocumentisnolongereditable,itisdefinitivelysavedintotheRTFformat.ACDATAsectionisutilizedforstoringtheroughdocumentwhateveritsformat,asitpermitstodisregardblocksoftextcontainingcharactersthatwouldotherwiseberegardedasmarkup(Rassinoux,Lovis,Baud&Geissbuhler,2003).

15WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 16: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

OntopinthisslideyoucanseeasampleXMLdescribinggenesfromDrosophilamelanogasterinvolvedinlong‐termmemory.Nestedwithinthegeneelements,aresub‐elementsrelatedtotheparent.Thefirstgeneincludestwonucleicacidsequences,aproteinproduct,andafunctionalannotation.Additionalinformationisprovidedbyattributes,suchastheorganism.Thisexampleillustratesthedifficultyofmodelingmany‐to‐manyrelationships,suchastherelationshipbetweengenesandfunctions.Informationaboutfunctionsmustberepeatedundereachgenewiththatfunction.Ifweinvertthenesting,thenwemustrepeatinformationaboutgeneswithmorethanasinglefunction.BelowtheXMLweseetheinformationaboutgenesusingbothRDFandOWL.BothgenesareinstancesoftheclassFlyGene,whichhasbeendefinedasthesetofallGenesfortheorganismD.melanogaster.Thefunctionalinformationisrepresentedusingahierarchicaltaxonomy,inwhichLong‐TermMemoryisasubclassofMemory(Louieetal.,2007).Remark:Drosophilamelanogasterisamodelorganismandsharesmanygeneswithhumans.AlthoughDrosophilaisaninsectwhosegenomehasonlyabout14,000genes(halfofhumans),aremarkablenumberofthesehaveveryclosecounterpartsinhumans;someevenoccurinthesameorderinthefly'sDNAasinourown.This,plustheorganism'smorethan100‐yearhistoryinthelab,makesitoneofthemostimportantmodelsforstudyingbasicbiologyanddisease(seee.g.http://www.lbl.gov/Science‐Articles/Archive/sabl/2007/Feb/drosophila.html)Note:Therelationaldatamodelrequirespreciseness:Thedatamustberegular,completeandstructured.However,inBiologytherelationshipsaremostlyun‐precise.Genomicmedicineisextremelydataintensiveandthereisanincreasingdiversityinthetypeofdata:DNAsequence,mutation,expressionarrays,haplotype,proteomicetc.Inbioinformaticsmanyheterogeneousdatasourcesareusedtomodelcomplexbiologicalsystems(Rassinoux,Lovis,Baud&Geissbuhler,2003),(Achard,Vaysseix &Barillot,2001).Thechallengeingenomicmedicineistointegrateandanalyzethesediverseandhugedatasourcestoelucidatephysiologyandinparticulardiseasephysiology.XMLissuitedfordescribingsemi‐structureddata,includingakindofnaturalmodelingofbiologicalentities,becauseitallowsfeaturesase.g.nesting(seeSlide5‐6ontop).StillakeylimitationofXMLis,thatitisdifficulttomodelcomplexrelationships;forexample,thereisnoobviouswaytorepresentmany‐to‐manyrelationships,whichareneededtomodelcomplexpathways.OntopinFigure5‐9wecanseeasampleXML,describinggenesinvolvedinthelong‐termmemoryofasamplespecimend.melanogaster.Nestedwithinthegeneelements,aresub‐elementsrelatedtotheparent.Thefirstgeneincludestwonucleicacidsequences,aproteinproduct,andafunctionalannotation.Additionalinformationisprovidedbyattributes,suchastheorganism.Thisexampleillustratesthedifficultyofmodelingmany‐to‐manyrelationships,suchastherelationshipbetweengenesandfunctions.Informationaboutfunctionsmustberepeatedundereachgenewiththatfunction.Ifweinvertthenesting(i.e.,nestinggenesinsidefunctionelements),thenwemustrepeatinformationaboutgeneswithmorethanasinglefunction.AtthebottominSlide5‐6weseethesameinformationaboutgenes,butusingRDFandOWL.BothgenesareinstancesoftheclassFlyGene,whichhasbeendefinedasthesetofallGenesfortheorganismD.melanogaster.Thefunctionalinformationisrepresentedusingahierarchicaltaxonomy,inwhichLong‐TermMemoryisasubclassofMemory(Louieetal.,2007).

16WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 17: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Thehumanproteininteractionnetworkanditsconnectiontopositiveselection.Proteinslikelytobeunderpositiveselectionarecoloredinshadesofred(lightred,lowlikelihoodofpositiveselection;darkred,highlikelihood)(6).Proteinsestimatednottobeunderpositiveselectionareinyellow,andproteinsforwhichthelikelihoodofpositiveselectionwasnotestimatedareinwhite(6).

17WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 18: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

http://www.barabasilab.com/pubs/CCNR‐ALB_Publications/200907‐24_Science‐Decade/200907‐24_Science‐CoverImage.gif

Anemergingtrendinmanyscientificdisciplinesisastrongtendencytowardbeingtransformedintosomeformofinformationscience.Oneimportantpathwayinthistransitionhasbeenviatheapplicationofnetworkanalysis.Thebasicmethodologyinthisareaistherepresentationofthestructureofanobjectofinvestigationbyagraphrepresentingarelationalstructure.Itisbecauseofthisgeneralnaturethatgraphshavebeenusedinmanydiversebranchesofscienceincludingbioinformatics,molecularandsystemsbiology,theoreticalphysics,computerscience,chemistry,engineering,drugdiscovery,andlinguistics,tonamejustafew.Animportantfeatureofthebook“StatisticalandMachineLearningApproachesforNetworkAnalysis”istocombinetheoreticaldisciplinessuchasgraphtheory,machinelearning,andstatisticaldataanalysisand,hence,toarriveatanewfieldtoexplorecomplexnetworksbyusingmachinelearningtechniquesinaninterdisciplinarymanner.Theageofnetworksciencehasdefinitelyarrived.Large‐scalegenerationofgenomic,proteomic,signaling,andmetabolomic dataisallowingtheconstructionofcomplexnetworksthatprovideanewframeworkforunderstandingthemolecularbasisofphysiologicalandpathologicalstates.Networksandnetwork‐basedmethodshavebeenusedinbiologytocharacterizegenomicandgeneticmechanismsaswellasproteinsignaling.Diseasesarelookeduponasabnormalperturbationsofcriticalcellular networks.Onset,progression,andinterventionincomplexdiseasessuchascanceranddiabetesareanalyzedtodayusingnetworktheory.Oncethesystemisrepresentedbyanetwork,methodsofnetworkanalysiscanbeappliedtoextractusefulinformationregardingimportantsystempropertiesandtoinvestigate itsstructureandfunction.Variousstatisticalandmachinelearningmethodshavebeendevelopedforthispurposeandhavealreadybeenappliedtonetworks.Dehmer,M.&Basak,S.C.2012.StatisticalandMachineLearningApproachesforNetworkAnalysis,WileyOnlineLibrary.

18WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 19: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Theconceptofnetworkstructuresisfascinating,compellingandpowerfulandapplicableinnearlyanydomainatanyscale.Networktheorycanbetracedbacktographtheory,developedbyLeonhardEulerin1736(see→Slide5‐8).However,stimulatedbyworkse.g.fromBarabási,Albert&Jeong (1999),researchoncomplexnetworkshasonlyrecentlybeenappliedtobiomedicalinformatics.Asanextensionofclassicalgraphtheory,seeforexample(Diestel,2010),complexnetworkresearchfocusesonthecharacterization,analysis,modelingandsimulationofcomplexsystemsinvolvingmanyelementsandconnections,examplesincludingtheinternet,generegulatorynetworks,protein‐proteinnetworks,socialrelationshipsandtheWebandmanymore.Attentionisgivennotonlytotrytoidentifyspecialpatternsofconnectivity,suchastheshortestaveragepathbetweenpairsofnodes(Newman,2003),butalsotoconsidertheevolutionofconnectivityandthegrowthofnetworks,anexamplefrombiologybeingtheevolutionofprotein‐proteininteractionnetworksindifferentspecies(→Slide5‐8).Inordertounderstandcomplexbiologicalsystems,thethreefollowingkeyconceptsneedtobeconsidered:(i)emergence,thediscoveryoflinksbetweenelementsofasystembecausethestudyofindividualelementssuchasgenes,proteinsandmetabolitesisinsufficienttoexplainthebehaviorofwholesystems;(ii)robustness,biologicalsystemsmaintaintheirmainfunctionsevenunderperturbationsimposedbytheenvironment;and(iii)modularity,verticessharingsimilarfunctionsarehighlyconnected.Networktheorycanlargelybeappliedforbiomedicalinformatics,becausemanytoolsarealreadyavailable(Costa,Rodrigues&Cristino,2008).

19WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 20: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Figure1p.74‐ precedingtotheyeastprotein networkAgraphG(V,E)describesastructurewhichconsistsofnodesakaverticesV,connectedbyasetofpairsofdistinctnodes(links),callededgesE{a,b}witha,b∈V;a≠b.Graphscontainingcyclesand/oralternativepathsarereferredtoasnetworks.Thevertexesandedgescanhavearangeofpropertiesdefinedascolors,whichalsomayhavequantitativevalues,referredtoasweights.InthisSlideweseethebasicbuildingblocksymbolsofabiologicalnetworkasusedinbioinformatics.Thebluedotsareservingasnetworkhubs,theredblockisacriticalnode(onacriticallink),thewhiteballsarebottlenecks,thestarssecondorderhubsetc.(Hodgman,French&Westhead,2010).

20WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 21: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Inordertorepresentnetworkdataincomputersitisnotcomfortabletousesets;morepracticalarematrices.Thesimplestformofagraphrepresentationisthesocalledadjacencymatrix.InthisSlideweseeanundirected(left)andadirectedgraphandtheirrespectiveadjacencymatrices.Ifthegraphisundirected,theadjacencymatrixissymmetric,i.e.,theelementsaij =aji foranyi andj.

21WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 22: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

ThisToolisaniceexampleontheusefulnessofadjacencymatrices:TheInfoVisToolkitisaninteractivegraphicstoolkitdevelopedbyJean‐DanielFekete atINRIA(TheFrenchNationalInstituteforComputerScienceandControl).Thetoolkitimplementsninetypesofvisualization:ScatterPlots,TimeSeries,ParallelCoordinatesandMatricesfortables;Node‐Linkdiagrams,IcicletreesandTreemapsfortrees;AdjacencyMatricesandNode‐Linkdiagramsforgraphs.Node‐Linkvisualizationsprovidesseveralvariants(8forgraphsand4fortrees).Therearealsoanumberofinteractivecontrolsandinformationdisplays,includingdynamicquerysliders,fisheyelenses,andexcentric labels.InformationabouttheInfoVistoolkitcanbefoundathttp://ivtk.sourceforge.netTheInfoVis Toolkitprovidesinteractivecomponentssuchasrangeslidersandtailoredcontrolpanelsrequiredtoconfigurethevisualizations.Thesecomponentsareintegratedintoacoherentframeworkthatsimplifiesthemanagementofrichdatastructuresandthedesignandextensionofvisualizations.Supporteddatastructuresincludetables,treesandgraphs.Allvisualizationscanusefisheyelensesanddynamiclabeling(Fekete,2004).

22WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 23: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Illustrationofthemeaningofcommonlyusedterms.Theprocessofdigitalimageformationinmicroscopyisdescribedinotherbooks.Imageprocessingtakesanimageasinputandproducesamodifiedversionofit(inthecaseshown,theobjectcontoursareenhancedusinganoperationknownasedgedetection,describedinmoredetailelsewhereinthisbooklet).Imageanalysisconcernstheextractionofobjectfeaturesfromanimage.Insomesense,computergraphicsistheinverseofimageanalysis:itproducesanimagefromgivenprimitives,whichcouldbenumbers(thecaseshown),orparameterizedshapes,ormathematicalfunctions.Computervisionaimsatproducingahigh‐levelinterpretationofwhatiscontainedinanimage.Thisisalsoknownasimageunderstanding.Finally,theaimofvisualizationistotransformhigher‐dimensionalimagedataintoamoreprimitiverepresentationtofacilitateexploringthedat

A. Holzinger                                                        LV709.049                                                 11.11.2015

WS 2015 23

Page 24: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Thetrulymulti‐disciplinarynetworksciencehasledtoawidevarietyofquantitativemeasurementsoftheirtopologicalcharacteristics(Costaetal.,2007).Theidentificationbetweenagraphandanadjacencymatrixmakesallthepowerfulmethodsoflinearalgebra,graphtheoryandstatisticalmechanicsavailabletousforinvestigatingspecificnetworkcharacteristics:Order(ainFigureSlide5‐11)=totalnumberofnodesnSize=totalnumberoflinks:∑_i▒∑_j▒a_ijClusteringCoefficient(binSlide5‐11)=thedegreeofconcentrationoftheconnectionsofthenode’sneighborsinagraphandgivesameasureoflocalinhomogeneityofthelinkdensity,i.e.thelevelofconnectednessofthegraph.Itiscalculatedastheratiobetweentheactualnumberti oflinksconnectingtheneighborhood(thenodesimmediatelyconnectedtoachosennode)ofanodeandthemaximumpossiblenumberoflinksinthatneighborhood:C_i=(2t_i)/(k(k_i‐1))Forthewholenetwork,theclusteringcoefficientisthearithmeticmean:C=1/n∑_i▒C_iPathlength(cinSlide5‐11)=isthearithmeticalmeanofallthedistances;Thecharacteristicpathlengthofnodei providesinformationabouthowclosenodei isconnectedtoallothernodesinthenetworkandisgivenbythedistanced(i,j)betweennodei andallothernodesjinthenetwork.ThePathlengthlprovidesimportantinformationaboutthelevelofglobalcommunicationefficiencyofanetwork:l=1/(n(n‐1))∑_(i≠j)▒d_ijNote:Numericalmethods,e.g.theDijkstra's algorithm(1959)areusedtocalculateallthepossiblepathsbetweenanytwonodesinanetwork.

24WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 25: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Centrality(dinSlide5‐12)=thelevelof“betweenness‐ centrality”ofanodei;itindicateshowmanyoftheshortestpathsbetweenthenodesofthenetworkpassthroughnodei.Ahigh“betweenness‐centrality”indicatesthatthisnodeisimportantininterconnectingthenodesofthenetwork,markingapotentialhubrole(referto→Slide5‐8)ofthisnodeintheoverallnetwork.Nodaldegree(einSlide5‐12)=numberoflinksconnectingi toitsneighbors.Thedegreeofnodei isdefinedasitstotalnumberofconnections.k_i=∑_i▒a_ijThedegreeprobabilitydistributionP(k)describesthep(x)thatanodeisconnectedtokothernodesinthenetwork.Modularity(finSlide5‐12)=describesthepossibleformationofcommunitiesinthenetwork,indicatinghowstronggroupsofnodesformrelativeisolatedsub‐networkswithinthefullnetwork(referalsoto→Slide5‐8)).

25WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 26: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Regularnetwork(ainSlide5‐13)hasalocalcharacter,characterizedbyahighclustering‐coefficient(cinSlide5‐13)andahighpathlength(L,Slide5‐13).Ittakesalargenumberofstepstotravelfromaspecificnodetoanodeontheotherendofthegraph.Aspecialcaseofaregularnetworkisthe:Randomnetwork,whereallconnectionsaredistributedrandomlyacrossthenetwork;theresultisagraphwitharandomorganization(outerrightinSlide5‐13).Incontrasttothelocalcharacteroftheregularnetwork,arandomnetworkhasamoreglobalcharacter,withalowCandamuchshorterpathlengthLthantheregularnetwork.Aparticularcaseisthe:Small‐worldnetwork(centerofSlide5‐13)whichareveryrobustandcombineahighleveloflocalandglobalefficiency.Watts&Strogatz (1998)showedthatwithalowprobabilitypofrandomlyreconnectingaconnectionintheregularnetwork,aso‐calledsmall‐worldorganizationarises.IthasbothahighCandalowL,combiningahighleveloflocalclusteringwithstillashortaveragetraveldistance.Manynetworksinnaturearesmall‐world(e.g.internet,protein‐networks,socialnetworks,functionalandstructuralbrainnetworketc.),combiningahighlevelofsegregationwithahighlevelofglobalinformationintegration.Inaddition,suchnetworkscanhaveaheavytailedconnectivitydistribution,incontrasttorandomnetworksinwhichthenodesroughlyallhavethesamenumberofconnections.Scale‐freenetworks(BinSlide5‐13)arecharacterizedbyadegreeprobabilitydistributionthatfollowsapower‐lawfunction,indicatingthatonaverageanodehasonlyafewconnections,butwiththeexceptionofasmallnumberofnodesthatareheavilyconnected.Thesenodesareoftenreferredtoashubnodes(see→Slide5‐8)andtheyplayacentralroleinthelevelofefficiencyofthenetwork,astheyareresponsibleforkeepingtheoveralltraveldistanceinthenetworktoaminimum.Asthesehubnodesplayakeyroleintheorganizationofthenetwork,scale‐freenetworkstendtobevulnerabletospecializedattackonthehubnodes.Modularnetworks(cinSlide5‐13)showtheformationofso‐calledcommunities,consistingofasubsetofnodesthataremostlyconnectedtotheirdirectneighborsintheircommunityandtoalesserextendtotheothernodesinthenetwork.Suchnetworksarecharacterizedbyahighlevelofmodularityofthenodes.

26WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 27: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

27WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 28: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

TherearemanywaystoconstructaproximitygraphrepresentationfromasetofdatapointsthatareembeddedinR^d.Letusconsiderasetofdatapoints{x_1,...,x_n}∈R^d .ToeachdatapointweassociateavertexofaproximitygraphGtodefineasetofverticesV={v1,v2,...,vn}.DeterminingtheedgesetEoftheproximitygraphGrequiresdefiningtheneighborsofeachvertexviaccordingtoitsembeddingxi.Consequently,aproximitygraphisagraphinwhichtwoverticesareconnectedbyanedgeiffthedatapointsassociatedtotheverticessatisfyparticulargeometricrequirements.Suchparticulargeometricrequirementsareusuallybasedonametricmeasuringthedistancebetweentwodatapoints.AusualchoiceofmetricistheEuclideanmetric.Lookattheslide:a)isourinitialsetofpointsintheplaneR^2b)ε‐ballgraphvi∼vj ifxj ∈B(vi;ε)c)k‐nearest‐neighborgraph(k‐NNG):vi∼vj ifthedistancebetweenxiandxj isamongthek‐thsmallestdistancesfromxitootherdatapoints.Thek‐NNGisadirectedgraphsinceonecanhavexiamongthek‐nearestneighborsofxj butnotviceversa.d)EuclideanMinimumSpanningTree(EMST)graphisaconnectedtreesub‐graphthatcontainsalltheverticesandhasaminimumsumofedgeweights.TheweightoftheedgebetweentwoverticesistheEuclideandistancebetweenthecorrespondingdatapoints.e)Symmetrick‐nearest‐neighborgraph(Sk‐NNG):vi∼vj ifxiisamongthek‐nearestneighborsofyorviceversa.f)Mutualk‐nearest‐neighborgraph(Mk‐NNG):vi∼vj ifxiisamongthek‐nearestneighborsofyandviceversa.Allverticesinamutualk‐NNgraphhaveadegreeupper‐boundedbyk,whichisnotusuallythecasewithstandardk‐NNgraphs.g)RelativeNeighborhoodGraph(RNG):vi∼vj iffthereisnovertexinB(vi;D(vi,vj))∩B(vj ;D(vi,vj)).h)GabrielGraph(GG)i)Theβ‐SkeletonGraph(β‐SG):Fordetailspleasereferto(Lézoray &Grady,2012),ortoaclassicalgraphtheorybook,e.g.(Harary,1969),(Bondy &Murty,1976),(Golumbic,2004),(Diestel,2010)

28WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 29: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Slide5‐16:GraphsfromImagesInthisslideweseetheexamplesofa)arealimagewiththequadtree tessellation,b)theregionadjacencygraphassociatedtothequadtree partition,c)Irregulartessellationusingimage‐dependentsuperpixel WatershedSegmentation(Vincent&Soille,1991)d)irregulartessellationusingimage‐dependentSLICsuperpixels (Lucchi etal.,2010)SLIC=SimpleLinearIterativeClustering)

29WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 30: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

AstraightforwardimplementationoftheoriginalVincent‐Soille algorithmisdifficultifplateausoccur.Therefore,analternativeapproachwasproposedby(Meijster &Roerdink,1995),inwhichtheimageisfirsttransformedtoadirectedvaluedgraphwithdistinctneighborvalues,calledthecomponentsgraphoff.Onthisgraphthewatershedtransformcanbecomputedbyasimplied versionoftheVincent‐Soille algorithm,wherefifo queuesarenolongernecessary,sincetherearenoplateausinthegraph(Roerdink &Meijster,2000).

30WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 31: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Theoriginalnaturaldigitalimageisfirsttransformedintogrey‐scale,thentheWatershedalgorithmisappliedandthenthecentroidfunctioncalculated,theresultsarerepresentativepointsetsintheplane.

31WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 32: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

TheDelaunayTriangulation(DT):vi∼vj iffthereisaclosedballB(•;r)withviandvj onitsboundaryandnoothervertexvk containedinit.ThedualtotheDTistheVoronoiirregulartessellationwhereeachVoronoicellisdefinedbytheset{x∈Rn |D(x,vk)≤D(x,vj)forallvj =vk}.Insuchagraph,∀vi,deg (vi)=3.(Lézoray &Grady,2012)

32WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 33: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

ThisanimationshowstheconstructionofaDelaunaygraph:Firsttheredpointsontheplanearedrawn,thenweinserttheblueedgesandtheblueverticesontheVoronoigraph,finallyherededgesdrawnbuildtheDelaunaygraph(Kropatsch,Burge&Glantz,2001).

http://oldwww.prip.tuwien.ac.at/research/research‐areas/structure‐and‐topology/graphs‐in‐image‐analysis/graphs‐in‐image‐analysis/use‐of‐graphs‐in‐image‐analysis/voronoi‐graph‐and‐delaunay‐graph

33WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 34: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

InthisSlideweseetheevaluatedinformation‐theoreticnetworkmeasuresonpublicationnetworks.HerefromtheexcellencenetworkofRWTHAachenUniversity.Thosemeasurescanbeunderstoodasgraphcomplexitymeasureswhichevaluatethestructuralcomplexitybasedonthecorrespondingconcept.Apossibleusefulinterpretationofthesemeasureshelpstounderstandthedifferencesinsubgraphs ofacluster.Forexampleonecouldapplycommunitydetectionalgorithmsandcompareentropymeasuresofsuchdetectedcommunities.Relatingthesedatatosocialmeasures(e.g.balancedscorecarddata)ofsub‐communitiescouldbeusedasindicatorsofcollaborationsuccessorlackthereof.Thenodesizeshowsthenodedegreewhereasthenodecolorshowsthebetweenness centrality,darkercolormeanshighercentrality(Holzingeretal.,2013a).

34WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 35: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Afurtherexampleshalldemonstratetheusefulnessofgraphtheoryandnetworkanalysis:ThisgraphshowsthemedicalknowledgespaceofastandardquickreferenceguideforemergencydoctorsandparamedicsintheGermanspeakingarea.Ithasbeensubsequentlydeveloped,testedinthemedicalrealworldandconstantlyimprovedfor20yearsbyDr.med.RalfMüller,emergencydoctoratGraz‐LKHUniversityHospitalandispracticallyinthepocketofeveryemergencyandfamilydoctorandparamedicsintheGermanspeakingarea(Holzingeretal.,2013b).UptoknowweknowthatGraphsandGraph‐Theoryarepowerfultoolstomapdatastructuresandtofindnovelconnectionsbetweensingledataobjects(Strogatz,2001),(Dorogovtsev &Mendes,2003).Theinferredgraphscanbefurtheranalyzedbyusinggraph‐theoreticalandstatisticalandmachinelearningtechniques(Dehmer,Emmert‐Streib &Mehler,2011).Amappingofthealreadyexistingandinthemedicalpracticeapproved“knowledgespace”asaconceptualgraphandthesubsequentvisualandgraph‐theoreticalanalysismayprovidenovelinsightsonhiddenpatternsinthedata.Anotherbenefitofthegraph‐baseddatastructureisintheapplicabilityofmethodsfromnetworktopologyandnetworkanalysisanddatamining,e.g.small‐worldphenomenon(Barabasi &Albert,1999),(Kleinberg,2000),andclusteranalysis(Koontz,Narendra &Fukunaga,1976),(Wittkop etal.,2011).Thegraph‐theoreticdataofthegraphseeninthisSlideinclude:Numberofnodes=641,numberofedges=1250,redareagents,blackareconditions,bluearepharmacologicalgroups,greyareotherdocuments.Theaveragedegreeofthisgraph=3.888,theaveragepathlength=4.683,thenetworkdiameter=9.

35WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 36: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Thenodesofthesamplegraphrepresent:drugs,clinicalguidelines,patientconditions(indication,contraindication),pharmacologicalgroups,tablesandcalculationsofmedicalscores,algorithmsandothermedicaldocuments;andtheedgesrepresent3crucialtypesofrelationsinducingmedicalrelevancebetweentwoactivesubstances,i.e.:pharmacologicalgroups,indicationsandcontra‐indications.Thefollowingexamplewilldemonstratetheusefulnessofthisapproach.

36WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 37: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Thisexampleshowsushowconvenientwecanfindwhichpathbetweentwonodesistheshortestaswellasthenavigationwaybetweenthesenodes.Computingshortestpathsisafundamentalandubiquitousprobleminnetworkanalysis.Wecan,e.g.applytheDijkstra‐algorithm,solvestheshortestpathproblemforagraphwithnon‐negativeedgepathcosts,producingashortestpathtree.Thisalgorithmisoftenusedinroutingandasasubroutineinothergraphalgorithms:Foragivennode,thealgorithmfindsthepathwithlowestcost(i.e.theshortestpath)betweenthatnodeandeveryothernode(Henzinger etal.,1997).

37WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 38: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

HereweseetherelationshipbetweenAdrenaline(centerblacknode)andDobutamine (topleftblacknode),Blue:PharmacologicalGroup,Darkred:Contraindication;Lightred:Condition,theGreennodes(fromdarktolight)are:1.Application(oneoremoreindications+correspondingdosages)2.Singleindicationwithadditionaldetails(e.g.“VFafter3rdShock”)3.Condition(e.g.VF,VentricularFibrillation)

38WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 39: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Ourbrainformsoneintegrativecomplexnetwork,linkingallbrainregionsandsub‐networkstogether(VanDenHeuvel &Hulshoff Pol,2010).Examiningtheorganizationofthisnetworkprovidesinsightsinhowourbrainworks.Graphtheoryprovidesaframeworkinwhichthetopologyofcomplexnetworkscanbeexamined;thuscanrevealnoveltiesaboutboththelocalandglobalorganizationoffunctionalbrainnetworks.Intheslidewecanseehowthemodelingofthefunctionalbrainbyagraphworks:edgesaretheconnectionsbetweenregionsthatarefunctionallylinked.First,thecollectionofnodesisdefined(A),secondtheexistenceoffunctionalconnectionsbetweenthenodesinthenetworkneedstobedefined,resultinginaconnectivitymatrix(B).Finally,theexistenceofaconnectionbetweentwopointscanbedefinedaswhethertheirleveloffunctionalconnectivityexceedsacertainpredefinedthreshold(C)(VanDenHeuvel &Hulshoff Pol,2010).

39WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 40: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Developmentofthehumanheartstarts2weeksafterfertilization,withtheformationofthecardiaccrescentandthesubsequentformationandloopingoftheprimitivehearttube.Insightintothebiologyofmolecularnetworksisanimportantfield,asanomaliesinthesesystemsunderlieawidespectrumofpolygenetichumandisorders,rangingfromschizophreniatocongenitalheartdisease(CHD).Understandingthefunctionalarchitectureofnetworksthatorganizethedevelopmentoforgans,seee.g.(Chien,Domian&Parker,2008),laysthefoundationofnovelapproachesinregenerativemedicine,sincemanipulationofsuchsystemsisnecessaryforsuccessoftissueengineeringtechnologiesandstemcelltherapy.Lage etal.(2010)developedaframeworkforgainingnewinsightsintothesystemsbiologyoftheproteinnetworksdrivingorgandevelopmentandrelatedpolygenichumandiseasephenotypes,exemplifiedwithheartdevelopmentandCHD.IntheSlideweseeexamplesoffourfunctionalnetworksdrivingthedevelopmentofdifferentanatomicalstructuresinthehumanheart.Thesefournetworksareconstructedbyanalyzingtheinteractionpatternsoffourdifferentsetsofcardiacdevelopment(CD):proteinscorrespondingtothemorphologicalgroups‘atrialseptal defects,’‘abnormalatrioventricular valvemorphology,’‘abnormalmyocardialtrabeculae morphology,’and‘abnormaloutflowtractdevelopment’.CDproteinsfromtherelevantgroupsareshowninorangeandtheirinteractionpartnersareshowningray.Functionalmodulesannotatedbyliteraturecuration areindicatedwithacoloredbackground.CentrallyintheFigureisahaematoxylin‐eosinstainedfrontalsectionoftheheartfroma37‐dayhumanembryo,wheretissuesaffectedbythefournetworksaremarked;AS(developingatrialseptum),EC(endocardial cushions,whichareanatomicalprecursorstotheatrioventricular valves),VT(developingventriculartrabeculae),andOFT(developingoutflowtract).

40WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 41: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

InthisSlideweseeanoverviewofthemodularorganizationofheartdevelopment:(A)Proteininteractionnetworksareplottedattheresolutionoffunctionalmodules.Eachmoduleiscolorcodedaccordingtofunctionalassignmentasdeterminedbyliteraturecuration.Theamountofproteinsineachmoduleisproportionaltotheareaofitscorrespondingnode.Edgesindicatedirect(lines)orindirect(dottedlines)interactionsbetweenproteinsfromtherelevantmodules.(B)Recyclingoffunctionalmodulesduringheartdevelopment.Thebarsrepresentfunctionalmodulesandrecyclingisindicatedbyarrows.Thebarsfollowthecolorcodeof(A)andtheheightofthebarsrepresentthenumberofproteinsineachmodule,asshownleftontheyaxis(Lage etal.,2010).Note:Phenotype=anorganism'sobservablecharacteristics(traits),e.g.morphology,biochemical/physiologicalproperties,behaviour,etc.Phenotypesresultfromtheexpressionofanorganism'sgenesaswellastheinfluenceofenvironmentalfactorsandtheinteractionsbetweenthem.Genotype=inheritedinstructionswithinitsgeneticcode.

41WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 42: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Diseases(e.g.obesity,diabetes,atherosclerosisetc.)resultfrommultiplegeneticandenvironmentalfactors,andimportantly,interactionsbetweengeneticandenvironmentalfactors.ThisSlideshowsthevastnetworksofmolecularinteractions.Itcanbeseenthatthegastrointestinal(GI)tract,vasculature,immunesystem,heartandbrainareallpotentiallyinvolvedineithertheonsetofdiseasessuchasatherosclerosisorincomorbiditiessuchasmyocardialinfarctionandstrokebroughtonbysuchdiseases.Further,therisksofcomorbiditiesfordiseasessuchasatherosclerosisareincreasedbyotherdiseases,suchashypertension,whichmay,inturn,involveotherorgans,suchaskidney.Therolethateachorganandtissuetypeplaysinagivendiseaseislargelydeterminedbygeneticbackgroundandenvironment,wheredifferentperturbationstothegeneticbackground(perturbationscorrespondingtoDNAvariationsthataffectgenefunction,which,inturn,leadstodisease)and/orenvironment(changesindiet,levelsofstress,levelofactivity,andsoon)definethesubtypesofdiseasemanifestedinanygivenindividual.Althoughthephysiologyofdiseasessuchasatherosclerosisisbeginningtobebetterunderstood,whathavenotbeenfullyexploitedtodataarethevastnetworksofmolecularinteractionswithinthecells.WeseeclearlyintheSlidethatthereisadiversityofmolecularnetworksfunctioninginanygiventissue,includinggenomicsnetworks,networksofcodingandnoncodingRNA,proteininteractionnetworks,proteinstatenetworks,signalingnetworks,andnetworksofmetabolites.Further,thesenetworksarenotactinginisolationwithineachcell,butinsteadinteractwithoneanothertoformcomplex,giantmolecularnetworkswithinandbetweencellsthatdriveallactivityinthedifferenttissues,aswellassignalingbetweentissues.VariationsinDNAandenvironmentleadtochangesinthesemolecularnetworks,which,inturn,inducecomplicatedphysiologicalprocessesthatcanmanifestasdisease.Despitethisvastcomplexity,theclassicapproachtoelucidatinggenesthatdrivediseasehasfocusedonsinglegenesorsinglelinearlyorderedpathwaysofgenesthoughttobeassociatedwithdisease.Thisnarrowapproachisanaturalconsequenceofthelimitedsetoftoolsthatwereavailableforqueryingbiologicalsystems;suchtoolswerenotcapableofenablingamoreholisticapproach,resultingintheadoptionofareductionistapproachtoteasingapartpathwaysassociatedwithcomplexdiseasephenotypes.Althoughtheemergingviewthatcomplexbiologicalsystemsarebestmodeledashighlymodular,fluidsystemsexhibitingaplasticitythatallowsthemtoadapttoavastarrayofconditions,thehistoryofsciencedemonstratesthatthisview,althoughlongtheideal,wasneverwithinreach,giventheunavailabilityoftoolsadequatetocarryingoutthistypeofresearch.Theexplosionoflarge‐scale,high‐throughputtechnologiesinthebiologicalsciencesoverthepast15to20yearshasmotivatedarapidparadigmshiftawayfromreductionisminfavorofasystems‐levelviewofbiology(Schadt &Lum,2006).

42WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 43: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Thethreemaintypesofbiologicalnetworks:(a)atranscriptionalregulatorynetworkhastwocomponents:transcriptionfactor(TF)andtargetgenes(TG),whereTFregulatesthetranscriptionofTGs;(b)protein‐proteininteractionnetworks:twoproteinsareconnectedifthereisadockingbetweenthem;(c)ametabolicnetworkisconstructedconsideringthereactants,chemicalreactionsandenzymes.

43WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 44: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

TheextremecomplexityoftheE.colitranscriptionalregulatorynetwork.Inthisgraphicalrepresentation,nodesaregenes,andedgesrepresentregulatoryinteractions.ThenetworkwasreconstructedusingdatafromtheRegulonDB (Salgadoetal.2006).Thisfigurehighlightstheextremecomplexityinregulatorynetworks.Toobtainadeeperunderstandingofregulatorycomplexity,scientistsmustfirstdiscoverbiologicallyrelevantorganizationalprinciplestounravelthehiddenarchitecturegoverningthesenetworks(seeNatureEducation:http://www.nature.com/scitable/content/the‐extreme‐complexity‐of‐the‐e‐coli‐14457504)

Thecomplexityoforganismsarisesratherasaconsequenceofelaboratedregulationsofgeneexpressionthanfromdifferencesingeneticcontentintermsofthenumberofgenes.Thetranscriptionnetworkisacriticalsystemthatregulatesgeneexpressioninacell.Transcriptionfactors(TFs)respondtochangesinthecellularenvironment,regulatingthetranscriptionoftargetgenes(TGs)andconnectingfunctionalproteininteractionstothegeneticinformationencodedininheritedgenomicDNAinordertocontrolthetimingandsitesofgeneexpressionduringbiologicaldevelopment.TheinteractionsbetweenTFsandTGscanberepresentedasadirectedgraph:Thetwotypesofnodes(TFandTG)areconnectedbyarcs(see→Slide5‐31,arrows)whenregulatoryinteractionoccursbetweenregulatorsandtargets.Transcriptionalregulatorynetworksdisplayinterestingpropertiesthatcanbeinterpretedinabiologicalcontexttobetterunderstandthecomplexbehaviorofgeneregulatorynetworks.Atalocalnetworklevel,thesenetworksareorganizedinsubstructuressuchasmotifsandmodules.Motifsrepresentthesimplestunitsofanetworkarchitecturerequiredtocreatespecificpatternsofinter‐regulationbetweenTFsandTGs.Threemostcommontypesofmotifscanbefoundingeneregulatorynetworks:(1)singleinput,(2)multipleinputand(3)feed‐forwardloopTargetgenesbelongingtothesamesingleandmultipleinputmotifstendtobeco‐expressed,andthelevelofco‐expressionishigherwhenmultipletranscriptionfactorsareinvolved.Modularityintheregulatorynetworksarisesfromgroupsofhighlyconnectedmotifsthatarehierarchicallyorganized,inwhichmodulesaredividedintosmallerones.Theevolutionofgeneregulatorynetworksmainlyoccursthroughextensiveduplicationoftranscriptionfactorsandtargetgeneswithinheritanceofregulatoryinteractionsfromancestralgeneswhiletheevolutionofmotifsdoesnotshowcommonancestrybutisaresultofconvergentevolution(Costa,Rodrigues&Cristino,2008).

44WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 45: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Theinteractionsbetweenproteinsareessentialtokeepthemolecularsystemsoflivingcellsworkingproperly.Protein‐proteininteraction(PPI)isimportantforvariousbiologicalprocessessuchascell‐cellcommunication,theperceptionofenvironmentalchanges,proteintransportandmodification.Complexnetworktheoryissuitabletostudyprotein‐proteininteractionmapsbecauseofitsuniversalityandintegrationinrepresentingcomplexsystems.Incomplexnetworkanalysiseachproteinisrepresentedasanodeandthephysicalinteractionsbetweenproteinsareindicatedbytheedgesinthenetwork.Manycomplexnetworksarenaturallydividedintocommunitiesormodules,wherelinkswithinmodulesaremuchdenserthanthoseacrossmodules(e.g.humanindividualsbelongingtothesameethnicgroupsinteractmorethanthosefromdifferentethnicgroups).Cellularfunctionsarealsoorganizedinahighlymodularmanner,whereeachmoduleisadiscreteobjectcomposedofagroupoftightlylinkedcomponentsandperformsarelativelyindependenttask.ItisinterestingtoaskwhetherthismodularityincellularfunctionarisesfrommodularityinmolecularinteractionnetworkssuchasthetranscriptionalregulatorynetworkandPPInetwork.TheSlideshowsahypotheticalproteincomplex(A).Binaryprotein‐proteininteractions(PPI)aredepictedbydirectcontactsbetweenproteins.Althoughfiveproteins(A,B,C,D,andE)areidentifiedthroughtheuseofabaitprotein(red),onlyAandDdirectlybindtothebait.(B)showsthetruePPInetworktopologyoftheproteincomplexisshownin.(C)depictsthePPInetworktopologyoftheproteincomplexinferredbythe‘‘matrix’’model,whereallproteinsinacomplexareassumedtointeractwitheachother.Finally(D)demonstratesthePPInetworktopologyoftheproteincomplexinferredbythe‘‘spoke’’model,whereallproteinsinacomplexareassumedtointeractwiththebait;butnootherinteractionsareallowed(Wang&Zhang,2007).

45WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 46: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Correlatedmotifmining(CMM)isthechallengetofindoverrepresentedpairsofpatterns(motifs),insequencesofinteractingproteins.AlgorithmicsolutionsforCMMtherebyprovideacomputationalmethodforpredictingbindingsitesforproteininteraction.ThetaskisbasicallytorepresentmotifsXandY(Figure119)totrulyrepresentanoverrepresentedconsensuspatterninthesequencesoftheproteinsinVX,respectivelyVY,inordertoincreasethelikelihoodthattheycorrespondoroverlapwithasocalledbindingsite—asiteonthesurfaceofthemoleculethatmakesinteractionsbetweenproteinsfromVXandVYpossiblethroughamolecularlock‐and‐keymechanism.Wecall{X,Y}a(k_x k_y k_xy )‐motifpairofaPPInetworkG=(V,E,λ)if|V_x |=k_x,|V_y |=k_y and|V_x∩V_y |=k_xyItiscalledcompleteifallverticesfromV_x areconnectedwithallverticesfromV_y (Boyen etal.,2011).

Ingenetics,asequencemotifisanucleotideoramino‐acidsequencepatternthatiswidespreadandhas,orisconjecturedtohave,abiologicalsignificance.Forproteins,asequencemotifisdistinguishedfromastructuralmotif,amotifformedbythethreedimensionalarrangementofaminoacids,whichmaynotbeadjacent.Inachain‐likebiologicalmolecule,suchasaproteinornucleicacid,astructuralmotifisasupersecondarystructure,whichappearsalsoinavarietyofothermolecules.Motifsdonotallowustopredictthebiologicalfunctionsbecausetheyarefoundinproteinsandenzymeswithdissimilarfunctions.Networkmotifsareconnectivity‐patterns(sub‐graphs)thatoccurmuchmoreoftenthantheydoinrandomnetworks.Mostnetworksstudiedinbiology,ecologyandotherfieldshavebeenfoundtoshowasmallsetofnetworkmotifs;surprisingly,inmostcasesthenetworksseemtobelargelycomposedofthesenetworkmotifs,occurringagainandagain.

46WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 47: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

ThegeneralsteepestascentalgorithmwithabstractneighborfunctionappliedtoCMM(SA‐CMM).

SincethedecisionproblemassociatedwithCMMisinNP,wecanefficientlycheckifamotifpairhashighersupportthananotherwhichmakesitpossibletotackleCMMasasearchprobleminthespaceofallpossible(l,d)‐motifpairs.Ifweaddtheassumptionthatsimilarmotifscanbeexpectedtogetsimilarsupport,ithasthetypicalformofacombinatorialoptimizationproblem.Incombinatorialoptimization,theobjectiveistofindapointinadiscretesearchspacewhichmaximizesauser‐providedfunctionf.Anumberofheuristicalgorithmscalledmetaheuristics areknowntoyieldstableresults,e.g.thesteepestascentalgorithm(Aarts &Lenstra,1997),illustratedaspseudocode intheSlide.

47WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 48: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Metabolismisprimarilydeterminedbygenes,environmentandnutrition.Itconsistsofchemicalreactionscatalyzedbyenzymestoproduceessentialcomponentssuchasaminoacids,sugarsandlipids,andalsotheenergynecessarytosynthesizeandusetheminconstructingcellularcomponents.Sincethechemicalreactionsareorganizedintometabolicpathways,inwhichonechemicalistransformedintoanotherbyenzymesandco‐factors,suchastructurecanbenaturallymodeledasacomplexnetwork.Inthisway,metabolicnetworksaredirectedandweightedgraphs,whoseverticescanbemetabolites,reactionsandenzymes,andtwotypesofedgesthatrepresentmassflowandcatalyticreactions.Onewidelyconsideredcatalogueofmetabolicpathwaysavailableon‐lineistheKyotoEncyclopediaofGenesandGenomes(KEGG).IntheSlideweseeasimplemetabolicnetworkinvolvingfivemetabolitesM1‐M5andthreeenzymesE1‐E3,ofwhichthelattercatalyzesanirreversiblereaction(Hodgman,French&Westhead,2010).

48WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 49: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Suchmetabolicstructurescanbeverylarge,ascanbeseeninthisSlide.Theenzyme‐codinggenesunderTrmB (thisisthethermococcus regulatorofmaltosebinding)actsasarepressorforgenesencodingglycolyticenzymesandasactivatorforgenesencodinggluconeogenic enzymescontrolincludedinthemetabolicpathwaysshownintheSlide(13areuniquetoarchaea and35areconservedacrossspeciesfromallthreedomainsoflife.Integratedanalysisofthemetabolicandgeneregulatorynetworkarchitecturerevealsvariousinterestingscenarios(Schmid etal.,2009).

49WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 50: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Electronicpatientrecords(EPRremainanunexplored,butrichdatasourcefordiscoveringe.g.correlationsbetweendiseases.(Roque etal.,2011)describeageneralapproachforgatheringphenotypicdescriptionsofpatientsfrommedicalrecordsinasystematicandnon‐cohortdependentmanner:Byextractingphenotypeinformationfromthe“free‐text”(=unstructuredinformation)insuchrecordstheydemonstratedthattheycanextendtheinformationcontainedinthestructuredrecorddata,anduseitforproducingfine‐grainedpatientstratificationanddiseaseco‐occurrencestatistics.TheirapproachusesadictionarybasedontheInternationalClassificationofDisease(ICD‐10)ontologyandisthereforeinprinciplelanguageindependent.AsausecasetheyshowhowrecordsfromaDanishpsychiatrichospitalleadtotheidentificationofdiseasecorrelations,whichsubsequentlycanbemappedtosystemsbiologyframeworks.

50WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 51: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Disease‐diseasecorrelations.Heatmap ofthemostsignificant100ICD10codes,basedonrankingthelistof802candidatepairsbytheircomorbidityscores.ChaptercolorsarehighlightednexttotheICD10codes.Diseasesthatoccuroftentogetherhaveredcolorintheheatmap,whilethosewithlowerthanexpectedco‐occurrencearecoloredblue.Thecolorlabelshowsthelog2changeofcomorbiditybetweentwodiseaseswhencomparedtotheexpectedlevel.doi:10.1371/journal.pcbi.1002141.g002

Roque etal.(2011)haveusedtextminingtoautomaticallyextractclinicallyrelevanttermsfrom5543psychiatricpatientrecordsandmappedthesetodiseasecodesintheICD10.Theyclusteredpatientstogetherbasedonthesimilarityoftheirprofiles.Theresultisapatientstratification,basedonmorecompleteprofilesthantheprimarydiagnosis,whichistypicallyused.Figure124illustratesthegeneralapproachtocapturecorrelationsbetweendifferentdisorders.SeveralclustersofICD10codesrelatingtothesameanatomicalareaortypeofdisordercanbeidentifiedalongthediagonaloftheheatmap,rangingfromtrivialcorrelations(e.g.,differentarthritisdisorders),tocorrelationsofcauseandeffectcodes(e.g.,strokeandmental/behavioural disorders),tosocialandhabitualcorrelations(e.g.drugabuse,liverdiseasesandHIV).

51WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 52: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Homology(plural:homologies)originsfromGreekὁμολογέω (homologeo)andmeans“toconform”(inGerman:übereinstimmen)andhasitsoriginsinBiologyandAnthropology,wherethewordisusedforacorrespondenceofstructuresintwolifeformswithacommonevolutionaryorigin(Darwin,1859).Inchemistryitisusedfortherelationshipbetweentheelementsinthesamegroupoftheperiodictable,orbetweenorganiccompoundsinahomologousseries.Inmathematicshomologyisaformalismfortalkinginaquantitativeandunambiguousmannerabouthowaspaceisconnected(Edelsbrunner &Harer,2010).Basically,homologyisaconceptthatisusedinmanybranchesofalgebraandtopology.Historically,thetermwasfirstusedinatopologicalsensebyHenryPoincaré.InBioinformatics,homologymodelling isamaturetechniquethatcanbeusedtoaddressmanyproblemsinmolecularmedicine.Homologymodelling isoneofthemostefficientmethodstopredictproteinstructures.Withtheincreaseinthenumberofmedicallyrelevantproteinsequences,resultingfromautomatedsequencinginthelaboratory,andinthefractionofallknownstructuralfolds,homologymodelling willbeevenmoreimportanttopersonalizedandmolecularmedicineinthefuture.Homologymodelling isaknowledge‐basedpredictionofproteinstructures.Inhomologymodelling aproteinsequencewithanunknownstructure(thetarget)isalignedwithoneormoreproteinsequenceswithknownstructures(thetemplates).Themethodofhomologymodelling isbasedontheprinciplethathomologueproteinshavesimilarstructures.Theprerequisiteforsuccessfulhomologymodelling isadetectablesimilaritybetweenthetargetsequenceandthetemplatesequences(morethan30%)allowingtheconstructionofacorrectalignment.Homologymodelling isaknowledge‐basedstructurepredictionrelyingonobservedfeaturesinknownhomologousproteinstructures.Byexploitingthisinformationfromtemplatestructuresthestructuralmodelofthetargetproteincanbeconstructed(Wiltgen &Tilz,2009).Twowell‐knownhomologymodelling programs,whicharefreeforacademicresearch,areMODELLER(http://salilab.org/modeller)andSWISSMODEL(http://swissmodel.expasy.org).Theslideshowsthecomparisonoftwoproteins:Thesequencesofbothproteinsare95%(53of56)identical(onlyresidues20,30and45differ),yetthestructuresaretotallydifferent.

52WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 53: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

All theareaswehavetouchedinthislectureareextremelyimportanttowardstheconceptofpersonalizedmedicineandmolecularmedicineandwillkeepusbusywithinthenextdecades.Dataminingismaybethemostcentralandmostimportantcomputationalsubjectinthisrespect.

53WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 54: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Alltheseapproachesareproducinggiganticamounts ofhighlycomplexdatasets!

Seetherecentarticle inScience– doublingofdatainproteomicsevery18months

54WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 55: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

55

My DEDICATION is to make data valuable … Thank you!The Klein-Bottle is the symbol for geometry and topology.

Topological data analysis (TDA) is a fast growing branch of applied mathematics and of enormous importance for data mining and knowledge discovery,particularly from large, high-dimensional, incomplete and noisy dirty data.

WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 56: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

56WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 57: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

http://psychology.wikia.com/wiki/Information_retrieval

57WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 58: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Networkmotifsinintegratedmolecularnetworksrepresentfunctionalrelationshipsbetweendistinctdatatypes.Theyaggregatetoformdensetopologicalstructurescorrespondingtofunctionalmoduleswhichcannotbedetectedbytraditionalgraphclusteringalgorithms.

58WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 59: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

http://www.nature.com/nri/journal/v3/n10/fig_tab/nri1200_F2.html

59WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 60: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

http://www.maa.org/cvm/1998/01/tprppoh/article/Pictures/KleinBottle.gif

60WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 61: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

61WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 62: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

62WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 63: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Nesting=recursion, subroutines,informationhiding,

63WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 64: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

OntopinFigure39weseeasampleXMLdescribinggenesinvolvedinlong‐termmemoryofasamplespecimenDrosophilamelanogaster.Nestedwithinthegeneelements,aresub‐elementsrelatedtotheparent.Thefirstgeneincludestwonucleicacidsequences,aproteinproduct,andafunctionalannotation.Additionalinformationisprovidedbyattributes,suchastheorganism.Thisexampleillustratesthedifficultyofmodelingmany‐to‐manyrelationships,suchastherelationshipbetweengenesandfunctions.Informationaboutfunctionsmustberepeatedundereachgenewiththatfunction.Ifweinvertthenesting(i.e.,nestinggenesinsidefunctionelements),thenwemustrepeatinformationaboutgeneswithmorethanasinglefunction.AtthebottominFigure39weseethesameinformationaboutgenes,butusingRDFandOWL.BothgenesareinstancesoftheclassFlyGene,whichhasbeendefinedasthesetofallGenesfortheorganismD.melanogaster.Thefunctionalinformationisrepresentedusingahierarchicaltaxonomy,inwhichLong‐TermMemoryisasubclassofMemory(Louieetal.,2007).

64WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 65: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

ThisisstarclusterstructureM30Letuslookintothemacroscopicareafirstandletuslookforsomesimilarities…ThisisstarglobularstarclusterM30(NGC7099),includingsome100.000starsadiameterofabout100light‐years,approx.40,000light‐yearsawayfromearth–lookatthestructure– lookatthesimilarity– andconsiderthetime,ifoureyesseethisstructuretheymightbevanished(DarwinChannel)Macroscopicstructure

65WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 66: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Fromthislargemacroscopic structurestotinymicroscopicstructureHereaX‐raycrystallography,which isastandardmethodtoanalyse thearrangementofobjects(atoms,molecules)withinacrystalstructure.Thisdatacontainsthemeanpositionsoftheentitieswithinthesubstance,theirchemicalrelationship,andvariousothers…andthedataisstored,forexample– ifhavingaproteinstructure– inaProteinDataBase(PDB).Thisdatabasecontainsvastamountsofdata.Ifamedicalprofessionallooksatthedata,heorsheseesonlylengthytablesofnumbers…

66WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 67: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Structures!Thisisnowourkeyword.Whenwetalkaboutstructures,wewillseesomereallyinterestingaspectsofstructures.Agoodexampleforadataintensiveandhighlycomplexmicroscopicstructureisayeastproteinnetwork.Note:Yeasts(Hefe)areeukaryoticmicro‐organisms(fungi)with1,500knownspeciescurrently,estimatedtobeonly1%ofallyeastspecies.Yeastsareunicellular,typicallymeasuring4µmindiameter.Inthispictureyoucanseethefirstproteininteractionnetwork(publishedbyJeong et.al,2001).Thenodesaretheproteins.Thelinksarethephysicalinteractions(bindings).Therednodesarelethaltotheorganism,thegreenonesarenon‐lethalandtheyellowonesarenotyetknown(stillunknown).Youmayaskwhetherthisstructureisuseful?Well,whatwegetoutbythisyeastissomethingwhichsomeofusmayreallylike:Prost!Theproblemwithsuchstructuresisthattheyareverybigandthattherearesomany!KnowledgeManagementcanhelptodiscoversuchunknownstructuresamongsttheenormoussetofuncharacterizeddata.Wewillcomebacktosuchstructuralhomologism later.NowletusmakeacloserlookonwhatKnowledgeManagementcandoforus.

67WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 68: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Whenthinkingaboutdata,weshouldalwayskeeptwofundamentalphysicalaspectsinmind:timerelatedaspects(e.g.entropyofdata)andspacerelatedaspects(e.g.topologyofdata).

http://www.youtube.com/watch?v=oBkOYQ02chsTedxWarwick 2010RogerPenroseinSpace‐TimeGeometry.http://www.youtube.com/watch?v=aSz5BjExs9oVisualizingElevenDimensions

68WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 69: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Cloudsofdata.Veryoften,dataisrepresentedasanunorderedsequenceofpointsinaEuclideann‐dimensionalspaceEn.Datacomingfromanarrayofsensorreadingsinanengineeringtestbed,fromquestionnaireresponsesinapsychologyexperiment,orfrompopulationsizesinacomplexecosystemallresideinaspaceofpotentiallyhighdimension.Theglobal‘shape’ofthedatamayoftenprovideimportantinformationabouttheunderlyingphenomenawhichthedatarepresents.Onetypeofdatasetforwhichglobalfeaturesarepresentandsignificantistheso‐calledpointclouddatacomingfromphysicalobjectsin3‐d.Touchprobes,pointlasers,orlinelaserssweepasuspendedbodyandsamplethesurface,record‐ing coordinatesofanchorpointsonthesurfaceofthebody.Thecloudofsuchpointscanbequicklyobtainedandusedinacomputerrepresentationoftheob‐ject.Atemporalversionofthissituationistobefoundinmotion‐capturedata,wheregeometricpointsarerecordedastimeseries.Inbothofthesesettings,itisimportanttoidentifyandrecognizeglobalfeatures:whereistheindexfinger,thekeyhole,thefracture?

69WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 70: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

70WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 71: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

a =orderb=clusteringcoefficientc=pathlengthd=centralitye=nodaldegreeF=modularityNetworkmetrics

71WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 72: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

72WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 73: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

http://www.google.com/patents/US6384826

73WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 74: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

74WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 75: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Representativeexamplesofdiseasecomplexesaredisplayed.Diseasesareassociatedwithtissuesbyusingourdisease–tissuematrix,andexpressiondataarefromtheGNFdataset.Theexpressionlevelsofcomplexesareshownaszscores.Ifadiseaseisassociatedwithmorethan3tissues,onlythe3mostassociatedtissuesareshownforclarity.Inagivencomplex,proteinsrelevanttothediseaseinquestionareyellow.Thefigureshowsthegeneraltendencyofoverexpressionofthecomplexesinthetissuesinwhichtheyareinvolvedinpathologycomparedwiththeirexpressionlevelinothertissues.Allmembersofthecomplexescanbeseenin

75WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 76: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Three‐dimensionalstructureofventricularmusclebasketweave,coronaryarterialtree,andpacemakerandconductionsystem.Oneofthecentralchallengesofcell‐basedtherapyforregeneratingspecificheartcomponentsisguidingtransplantedcellsintoafunctionalsyncytiumwiththeexistingthree‐dimensionalarchitecture.Transplantedcellsmustmakefunctionalconnectionswithneighboringspecializedheartcellstoresultinanetgainofglobalfunction.Transplantedmyogenicprogenitors,forexample,mustalignwithandintegrateintotheexistingventricularmusclebasketweavetoallowsynchronouscontractionandrelaxationofgraftandhostmyocardium.Integrationofpacemakerandconductionsystemprogenitorsintotheappropriatetissuetypeisnecessarytogenerateabiologicalpacemakerandavoidcardiacarrhythmia.Forexample,havingatransplantedheartmuscleprogenitorintegrateintotheconductionsystemmighthavearrythmogenic consequences,aswouldtheintroductionofcellswithindependentpacemakerpotentialintheheart.Similarly,cell‐basedtherapiestopromotecoronarycollateralformationorneo‐arteriogenesis requirefunctionalintegrationoftransplantedcellswiththehostcoronaryarterialtree.

76WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 77: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

Three‐dimensionalstructureofventricularmusclebasketweave,coronaryarterialtree,andpacemakerandconductionsystem.Oneofthecentralchallengesofcell‐basedtherapyforregeneratingspecificheartcomponentsisguidingtransplantedcellsintoafunctionalsyncytiumwiththeexistingthree‐dimensionalarchitecture.Transplantedcellsmustmakefunctionalconnectionswithneighboringspecializedheartcellstoresultinanetgainofglobalfunction.Transplantedmyogenicprogenitors,forexample,mustalignwithandintegrateintotheexistingventricularmusclebasketweavetoallowsynchronouscontractionandrelaxationofgraftandhostmyocardium.Integrationofpacemakerandconductionsystemprogenitorsintotheappropriatetissuetypeisnecessarytogenerateabiologicalpacemakerandavoidcardiacarrhythmia.Forexample,havingatransplantedheartmuscleprogenitorintegrateintotheconductionsystemmighthavearrythmogenic consequences,aswouldtheintroductionofcellswithindependentpacemakerpotentialintheheart.Similarly,cell‐basedtherapiestopromotecoronarycollateralformationorneo‐arteriogenesis requirefunctionalintegrationoftransplantedcellswiththehostcoronaryarterialtree.

77WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015

Page 78: A. Holzinger LV709human-centered.ai/wordpress/wp-content/uploads/... · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, New York, Springer

78WS 2015

A. Holzinger                                                        LV709.049                                                 11.11.2015