new a. holzinger 709.049 mi, 04.11 - human-centered.ai · 2015. 11. 4. · biomedical data...

Post on 15-Oct-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Status asofDi,03.11.2015,10:00

Dear Students,welcometothe4thlectureofourcourse.Pleaserememberfromthelastlecture:modelingofknowledge,medicalOntologies,ClassificationeffortsandtheInternationalClassificationofDiseases(ICD);StandardizedNomenclatureofMedicineClinicalTerms(SNOMEDCT);MedicalSubjectHeadings(MeSH);UnifiedMedicalLanguageSystem(UMLS);

Pleasealwaysbeawareofthedefinitionofbiomedicalinformatics(MedizinischeInformatik):BiomedicalInformatics istheinter‐disciplinaryfieldthatstudiesandpursuestheeffectiveuseofbiomedicaldata,information,andknowledgeforscientificinquiry,problemsolving,anddecisionmaking,motivatedbyeffortstoimprovehumanhealth(and well‐being).

1WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

2WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Bayes’RuleBiomedicaldatawarehouseBusinesshospitalinformationsystemClinicalworkflowDataintegrationEnterprisedatamodelingInformationretrieval(IR)ProbabilisticModelQualityofinformationretrievalSettheoreticmodelVectorSpaceModel(VSM)

3WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

4WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

5WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

6WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

7WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

8WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Attheendofthisfourthlectureyou…

…haveanoverviewaboutthegeneralarchitectureofanHospitalInformationSystem(detailsinlecture10:MedicalInformationSystemsandBiomedicalKnowledgeManagement);…knowsomeprinciplesofhospitaldatabases;…haveanoverviewonsomebiomedicaldatabases;…arefamiliarwithsomebasicsofinformationretrieval.

9WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Amongstother problemsomekeychallengesinclude:Increasinglylargeandcomplexdatasets“BigData”duetodataintensiveresearchIncreasingamountsofnon‐standardizedandun‐structuredinformation(e.g.freetext)Dataquality,dataintegration,universalaccessPrivacy,security,safetyanddataprotectionissues(see→Lecture11)Timeaspectsindatabases(Gschwandtner,Gärtner,Aigner &Miksch,2012),(Johnston&Weis,2010).“BigDataresourcesareallawasteoftimeandmoneyifdataanalystscannotfind,orfailtocomprehend,thebasicinformationthatdescribesthedataheldintheresources(Berman,2013b)”DataidentificationiscertainlythemostunderappreciatedandleastunderstoodBigDataissue.Measurements,annotations,properties,andclassesofinformationhavenoinformationalmeaningunlesstheyareattachedtoanidentifierthatdistinguishesonedataobjectfromallotherdataobjectsandthatlinkstogetheralloftheinformationassociatedwiththeidentifieddataobject(Berman,2013a).Communicationofdatabetweenapplicationsystemsmustensuresecuritytoavoidimproperaccess,becausetrustorthelackthereof,isthemostessentialfactorblockingtheadoptionofrapidlyevolvingWebtechnologyparadigmsuchassoftwareasservice(SaaS)ordatadistributionservicessuchasCloudcomputing(Sreenivasaiah,2010).

10WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Beforewediscussinformationsystems andlearnaboutdatabases,letusstartwithalookintotheHospital…

11WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Letusstartwithalookintothehospital:Inthisslideweseeatypicalhospitalscenario:medicalprofessionalsaresurroundedbyinformationtechnology.Anolddreamofhospitalmanagerswasalwaystohavean“alldigitalhospital”todigitalizeallworkflowsandtostorealldatainanelectronicway– towardsapaperlesshospital.Althoughmuchefforthasbeenspenttowardsapaperlesshospital,mosthospitalsworldwidearestillfarawayfrombeinga“all‐digitalhospital”(Waterson,Glenn&Eason,2012).Aninterestingstudy:AllhospitalsintheprovinceofStyria(Austria)arewellequippedwithsophisticatedInformationTechnology,whichprovidesall‐encompassingon‐screenpatientinformation.Previousresearchmadeonthetheoreticalproperties,advantagesanddisadvantages,ofreadingfrompapervs.readingfromascreenhasresultedintheassumptionthatreadingfromascreenisslower,lessaccurateandmoretiring.However,recentflatscreentechnology,especiallyonthebasisofLCD,isofsuchhighqualitythatobviouslythisassumptionshouldnowbechallenged.Astheelectronicstorageandpresentationofinformationhasmanyadvantagesinadditiontoafastertransferandprocessingoftheinformation,theusageofelectronicscreensinclinicsshouldoutperformthetraditionalhardcopyinbothexecutionandpreferenceratings.InastudyintheCountyhospitalStyria,Austria,with111medicalprofessionals,workinginareal‐lifesetting,theywereeachaskedtoreadoriginalandauthenticdiagnosisreports,agynecologicalreportandaninternalmedicaldocument,onbothscreenandpaperinarandomlyassignedorder.ReadingcomprehensionwasmeasuredbytheChunkedReadingTest,andspeedandaccuracyofreadingperformancewasquantified.Inordertogetafullunderstandingoftheclinicians'preferences,subjectiveratingswerealsocollected.WilcoxonSignedRankTestsshowednosignificantdifferencesonreadingperformancebetweenpapervs.screen.However,medicalprofessionalsshowedasignificant(90%)preferenceforreadingfrompaper.Despitethehighqualityandthebenefitsofelectronicmedia,paperstillhassomequalitieswhichcannotprovidedelectronicallydodate(Holzingeretal.,2011).BTW:GrazUniversityHospital istheflagshiphospitaloftheStyrianKAGESwith23countyhospitalsandisamongstthelargesthospitalsinEurope.

12WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Megaissuesrelatedwithhospitalinformationsystemsinclude:dataintegration,datafusion,standardizationissues,clinicalprocessanalysis,modeling,complianceissues,evidencebasedtreatmentanddecisionsupport,privacy,security,safetyanddataprotectionandknowledgediscoveryanddatamining– allconnectedwiththecentraltopicofthislecture:databases.

BTW:TheKAGESusesopenMEDOCS basedonish.med whichisbasedonSAPR3,anoverviewaboutdifferentbusinesshospitalinformationsystemsvendorscanbefoundhere:

13WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

14

Theteamworkinthehospitalrequiresalotofcommunicationandinformationexchange.Thevisionofabusinessenterprisehospitalinformationsystemistocoverallworkflows,organizationalprocessesandinformationflowselectronically.

Note:Thequalityoftheworkofphysiciansisheavilyinfluencedbytheusabilityoftheiravailableequipment.Intheslideyouseeatypicalworkmeetingofmedicalprofessionals,wheretheydiscussthepatientcasesjointly.Itisimportanttostudyandunderstandtheworkflowsoftheendusersandtoinvolvethemintothedevelopmentofinformationsystemsasearlyaspossiblebyauser‐centereddesignprocess(Holzinger,2003).Experimentsshowedthatbystudyingtheworkflowstheengineersgetdeepinsightsintohowtodevelopanappropriateapplicationforaspecifiedtargetendusergroup(Holzinger,Geierhofer,Ackerl &Searle,2005).

WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Theaforementionedgoalofan“all‐digitalhospital”produces“bigdata”andremarkablymuchofthedataisunstructuredtext.Interestingly,themainandmostimportantoutputisthemedicalreport(Arztbrief):Intheexampleitisthereportofamedicalimage– nottheimageitselfistherelevantissue– itisthereport(Holzinger,Geierhofer &Errath,2007b).Thehandlingwithunstructureddataisamegachallengeandbringsalongalotofchallengesforcomputers.

15WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Letusbrieflycomparehumanintelligencewithmachineintelligence.Agoodexampleonthecomplexitywhichwearefacinginhospitalinformationprocessingarethedifferencesbetweenchessandhumannaturallanguageprocessing:Whereaschessisafinite,mathematicallywell‐definedsearchspace,hencewehaveawelldefinedcomputationalspace,withlimitednumbersofmovesandstatesandgroundedinexplicit,unambiguousmathematicalrules,humanlanguageisexactlytheopposite:Ambiguous,contextualandimplicit;groundedinthehumancognitivespace,withaseeminglyinfinitenumberofwaystoexpressoneandthesamemeaning.Note:IBMDeepBluedefeatedtheWorldChessChampionGarryKasparovinasix‐gamematchin1997.Therewereanumberoffactorsthatcontributedtothissuccess,including:asingle‐chipchesssearchengine,amassivelyparallelsystemwithmultiplelevelsofparallelism,astrongemphasisonsearchextensions,acomplexevaluationfunction,andeffectiveuseofaGrandmastergamedatabase.Technically,DeepBluewasamassivelyparallelsystemdesignedforcarryingoutchessgametreesearches.Thesystemwascomposedofa30‐nodeIBMRS/6000SPcomputerand480single‐chipchesssearchengines,with16chesschipsperSPprocessor.TheSPsystemconsistsof28nodeswith120MHzP2SCprocessors,and2nodeswith135MHzP2SCprocessors.Thenodescommunicatedwitheachotherviaahighspeedswitchandallnodeshad1GBofRAM,and4GBofdisk.Duringthe1997matchwithKasparov,thesystemrantheAIX4.2operatingsystem.ThechesschipsinDeepBluewereeachcapableofsearchingupto2.5millionchesspositionspersecond,andcommunicatewiththeirhostnodeviaafastmicrochannelbus(Campbell,Hoane &Hsu,2002).

16WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

17WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Youcanrememberwhat welearnedlastlectureaboutworkflowsandworkflowmodelling..

18WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Healthcareprocessesrequirethecooperationofdifferentorganizationalunitsandvariousmedicaldisciplines.Insuchanenvironmentoptimalprocesssupportbecomescrucial.Inthisslideweseeatypicalorganizationalprocessformedicalorderentryandresultreporting,whichisusedtocoordinatetheinter‐departmentalcommunicationbetweenaward(ambulatorysetting)andtheradiologyunit.Thedepictedprocessisnottailoredtoaspecificclinicalpathway,butshowsanexampleforacharacteristicorganizationalprocedureofthehospital:Anorder(inGerman:Anweisung,Verschreibung)isplacedbyaphysicianatthewardoratanambulatorysetting.Theindicationischeckedintheradiologydepartmentanddependingontheresulttheorderplacerisinformedwhethertherequesthasbeenrejectedorscheduled.Theactualradiologicalexaminationandcorrespondingdocumentationisdoneintheexaminationroom.Theradiologyreportisgeneratedafterwards,whichhastobevalidatedbythephysicianwithhissignature.Thereportissentbacktotheorderplacer.Thisisanexampleforafundamentalprocessofclinicalpracticeandcapturestheorganizationalknowledgenecessarytocoordinatethehealthcareprocessamongdifferentpeopleandorganizationalunits;i.e.,focusisonthesupportofcoreorganizationalprocesses(Lenz&Reichert,2007).

19WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Thisisjustthatyouhave anidea,howcomplicatedsuchprocessescanbeandyoucanimaginehowdifficultitistodigitalizeallinvolveddata

20WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Themedicaltreatmentprocessisoftendescribedasdiagnostic–therapeuticcycle(Bemmel&Musen,1997)including:observation,reasoning,andaction.Pleaserememberthatinmedicinewedealwithuncertaininformation(Holzinger&Simonic,2011)andeachpassofthediagnostic‐therapeuticcyclecanbeseenasastepindecreasingtheuncertaintyaboutthepatient’sdisease.Consequently,theobservationprocessalwaysstartswiththepatienthistory(“lookingintothepast”)andproceedswithdiagnosticprocedureswhichareselectedbasedonavailableinformation.TheaimoftheHISistoassisthealthcarepersonnelinmakinginformeddecisions.Maybethemostimportantquestiontobeansweredishowtodeterminewhatisrelevant.Availabilityofrelevantinformationisapreconditionfor(good)medicaldecisions– andthemedicalknowledgeguidesthesedecisions(Lenz&Reichert,2007).FollowingtheprinciplesofEvidencebasedmedicine(EBM)physiciansarerequiredtoformulatequestionsbasedonpatients’problems,searchtheliteratureforanswers,evaluatetheevidenceforitsvalidityandusefulness,andfinallyapplythe(new)informationtopatientstreatment(Hawkins,2005).Thelimitingfactoristheshorttimeaclinicianhastomakeadecision(Gigerenzer &Gaissmaier,2011).

21WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

22WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Thisslideshowsaclassicalconceptualmodel:Theheartisacentraldataandcommunicationstructure.Thepatients“enter”(logically)thesystemthroughtheadmissionontheleftside,transferanddischargefunctionsofthecoreandleavesthesystem,atleastpartially,throughtherightside.Inthemainfocusisacentraldatabase,althoughalternativesolutionshaveoptedforamoredistributedconstructionofdatabases;nonethelesscentralorderingprincipleshavetobekepttoachievethenecessaryintegrationofinformationandthedistributiontothevariouspointswhereitisneeded,beitintheareaofhospitalmanagementorinthefieldofcareprovision.Thiscentraldatabaseisservingthecentraloperationalpurposesofthehospitalinthecontextofitsdualgoals(Haux etal.,1998),(Reichertz,2006),(Haux,2006).

23WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Hereyou seethetypicalarchitectureofsuchasystem

ICU=IntensiveCareUnitNICU=NeonatalIntensiveCareUnitPICU=PediatricIntensiveCareUnit

Therearemanydifferentapplicationarchitecturesinuse,andwewillcometoitbacklater,in→Lecture10,soherejustONEexampleforaenterprisebusinesshospitalinformationsystemasitiscalledprofessionally.However,wewantnowtoconcentrateonsometechnicalissuesofdatabases.

24WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

In ahospitaltherearedata,data,data,…

Inthisclassicalimageby(Shortliffe,Perrault,Wiederhold &Fagan,2001)itbecomesveryobviousthatdatabasesarecentralcomponentsforanhospitalinformationsystem.Averyinterestingslideisthenext,whereweseeanhistoricalexamplefromthe“stone‐age”ofcomputerscience.

25WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Thispictureby(Gardner,Pryor&Warner,1999)isinsofarinterestingasitshowsusclearlyamegaissueuptothepresent:tointegrateandfusiondifferentdataandtomakeitaccessibletotheclinician.Whilethereismuchresearchontheintegrationofheterogeneousinformationsystems,ashortcomingisintheintegrationofavailabledata.Justtoclarifythedifferencesbetweendataintegrationanddatafusion:

Dataintegrationinvolvescombiningdataresidingindifferentdistributedsourcesandprovidinguserswithaunifiedviewofandaccesstothesedata.Ithasbecomethefocusofextensivetheoreticalandpracticalwork,andnumerousopenproblemsremainunsolved(Lenzerini,2002).Datafusionistheprocessofmergingmultiplerecordsrepresentingthesamereal‐worldobjectintoasingle,consistent,accurate,andusefulrepresentation(Bleiholder &Naumann,2008).ThetrendtowardsP4medicine(Predictive,Preventive,Participatory,Personalized)hasresultedinasheermassofthegenerated(‐omics)data,henceamainchallengeisintheintegrationandfusionofheterogeneousdatasources,especiallyintheintegrationofdatafromtheclinicaldomainwithsourcesfromthebiologicaldomain.

26WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Integration – datafusion– fordataanalysis– thecentralgoaltosupportdecisionmakingprocesses– datavirtualization– abstractlayer– businessintelligence–serviceorientedarchitecture

27WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Database(DB)istheorganizedcollectionofdatathroughacertaindatastructure(e.g.hash‐table,adjacencymatrix,graphstructure,etc.).Databasemanagementsystem(DBMS)issoftwarewhichoperatestheDB.WellknownDBMSsinclude:Oracle,IBMDB2,MicrosoftSQLServer,MicrosoftAccess,MySQL,SQLite.ExamplesforGraphDatabasesincludeInfoGrid,Neo4j,orBrightstarDB.TheusedDBisnotgenerallyportable,butdifferentDBMSscaninter‐operatebyusingstandardssuchasSQLandODBC.Databasesystem(DBS)=DB+DBMS.Thetermdatabasesystememphasizesthatdataismanagedintermsofaccuracy,availability,resilience,andusability.Datawarehouse(DWH)isanintegratedrepositoryusedforreportingandlongtermstorageofanalysisdata.DataMarts(DM)areaccesslayersofaDWHandareusedastemporaryrepositoriesfordataanalysis.

RecommendableReadinginclude:(Plattner,2013),(Robinson,Webber&Eifrem,2013):Robinson,I.,Webber,J.&Eifrem,E.2013.GraphDatabases,O'ReillyMedia.Plattner,H.2013.ACourseinIn‐MemoryDataManagement:TheInnerMechanicsofIn‐MemoryDatabases,HeidelbergNewYorkDordrechtLondon,Springer.Oneofthestandardtextbooksisthe6thediton of"DatabaseSystemConcepts"by(Silberschatz,Korth &Sudarshan,2010).

28WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

ADWHisanintegratedsystem,specificallydesignedforenterprisebusinessdecisionsupportandcanbeusedinhospitalsandinbiomedicalapplications.InSlide4‐13weseeanexampleofahospitaldatawarehouse:Onthelefttherearethe(heterogeneous)datasources,suchasPACS(PictureArchiving&CommunicationSystem)andRIS(RadiologicalInformationSystem),andapartfromthecoreHIS,somespecialdatabaseswhichcanalsoincludeproprietaryandlegacysystems.ForthedatastagingandareaserverstheCommonObjectRequestBrokerArchitecture(CORBA)isused,astandarddefinedbytheObjectManagementGroup(OMG)thatsupportsmultipleplatforminteroperability(Zhang,Zhang,Tjandra &Wong,2004).Thisisastandardhospitalinformationarchitectureand– typically‐ withnointegrationoflaboratorydatasourcesandmostofallnoOmics‐dataintegration,asforexamplefromthepathologyorabio‐bank.

29WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

ADWHcanbesubdividedintoso‐calleddatamarts(DM),whichcanbeseenasspecificaccesslayerofaDWH,orientedtoaspecificteam.Slide4‐14showsthearchitectureoftheMayoclinicDWH,whichisincrementallyinstantiatingeachcomponentofthearchitectureondemand.Dataintegrationproceedsfromlefttoright(leftmostyouseetheprimarydatasources;movingright,thedataareintegratedintostagingandreplicationservices,withfurtherrefinement).Thelayersare:1)Subjects=thehighestlevelareasthatdefinetheactivitiesoftheenterprise(e.g.Individual);2)Concepts=thecollectionsofdatathatarecontainedinoneormoresubjectareas(e.g.,Patient,Provider,Referrer,etc.);3)BusinessInformationModels=theorganizationofthedatathatsupporttheprocessesandworkflowsoftheenterprise’sdefinedConcepts.(Chute,Beck,Fisk&Mohr,2010)

30WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

CloudcomputingisagoodexampleforSoftwareasaservice – flexiblespacevianetwork– thisremindsustotheearlydaysofcomputingwithmainframecomputingandthin‐clientterminals.

31WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

AstandardenvironmentforproductionandprocessingofgenomicdatacanbeseeninSlide4‐15:Sequencinglabssubmittheirdatatolargedatabases,e.g.GenBank,NationalCenterofBiotechnologyInformation(NCBI);EuropeanBioinformaticsInstitute(EMBL)database;DNADataBankofJapan(DDBJ);ShortReadArchive(SRA);GeneExpressionOmnibus(GEO)orMicroarraydatabaseArrayExpress.Thesemaintain,organizeanddistributethesequencingdata.Mostusersaccesstheinformationeitherthroughweb‐basedapplicationsorthroughintegrators,suchasEnsembl,theUniversityofCaliforniaatSantaCruz(UCSC)GenomeBrowserorGalaxy.Theendusershavetodownloadgenomicdatafromtheseprimaryandsecondarysources(Stein,2010).Remember:SequencingistheprocessofdeterminingthepreciseorderofnucleotideswithinaDNAmoleculetodeterminetheorderofthefourbases—adenine,guanine,cytosine,andthymine—inastrandofDNA.TheadventofrapidDNAsequencingmethodshasgreatlyacceleratedbiologicalandmedicalresearchanddiscoveryandproduceslargedatasets.Sequencinghasbecomeindispensableforbasicbiologicalresearch,andinnumerousappliedfieldssuchasdiagnosticsandbiotechnology.Note:Abiobank isaphysicalplacewhichstoresbiologicalspecimens– andinsomecasesalsodata(Roden etal.,2008).

32WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Hereweseeacloud‐basedgenomeinformaticssystem.Insteadofseparategenomedatasetsstoredatvariouslocations,thedatasetsarestoredinthecloudasvirtualdatabases.Webservicesrunontopofthesedatasets,includingtheprimaryarchivesandtheintegrators,runningasvirtualmachineswithinthecloud.Casualusers,whoareaccustomedtoaccessingthedataviatheNCBI,DDBJ,Ensembl orUCSC,workasusual;thefactthattheseserversarelocatedinsidethecloudisinvisibletothem.Poweruserscancontinuetodownloadthedata,buthaveanattractivealternative.Insteadofmovingthedatatothecomputationalcluster,theymovethecomputationalclustertothedata(Stein,2010).Note:Cloudcomputingisbasedonsharingofresourcestoachievecoherenceandeconomiesofscaleoveranetwork(similartotheelectricitygrid).FourAtthefoundationofcloudcomputingisthebroaderconceptofconvergedinfrastructureandsharedservices.Cloudprovidersoffertheirservicesaccordingtoseveralfundamentalmodels1)Infrastructureasaservice(IaaS),2)Platformasaservice(PaaS),3)Softwareasaservice(SaaS)

33WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Justanexampleforacloudbasedservice:TheMasterIndexisthePACSCloudcoreentityandcontainsinformationaboutothermodules,includingGatewaysandCloudSlaves(repositoryanddatabase).Italsoprovidesauthenticationservicestoinstitutionalgatewaysandallidentifiableinformationrelatedwithpatientsarestoredinamasterindexdatabase,fundamentaltoensuresolutionsforconfidentialityandprivacy.TheCloudSlavesprovide,ononehand,storageofsightlessdata(objectsrepositories)and,onotherhand,adatabasecontainingallnoidentifiablemetadataextractedfromDICOMstudies,i.e.themostdemandingtaskconcerningcomputationalpower(Bastiao‐Silva,Costa,Silva&Oliveira,2011).

34WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Wehavetodeterminebetweenfederateddataandwarehouseddata.Afederateddatabasesystemisameta‐databasemanagementsystem,whichtransparentlymapsmultipleheterogeneousandautonomousdatabasesystemsintoasinglefederateddatabaseandthiscanbea“virtualdatabase”– withoutdataintegrationasitisindatawarehouses.Intheslidewecanseeonthey‐axisthedataintegrationarchitectureandonthex‐axistheknowledgerepresentationmethodologiesandwherecurrentdataintegrationsystemsliealongthiscontinuum.Theessenceofthisimageisthatthereisno“best‐solution”:Asystemdesignedtohavefullcontrolofdataandfastqueriescanhavedifficultyexpressingcomplexbiologicalconceptsandintegratingthem.SystemsthatemployhighlyexpressiveknowledgerepresentationmethodologiessuchasOntologiesaremoreabletorepresentandintegratecomplexbiologicalconceptsbuthavemuchlesstractablequeries(Louieetal.,2007).

35WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Obviously thereisadifferencebetweenthedatabasesfortheHospitalInformationSystemandthedatabaseswhichareusedforscientificwork.

36WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

WhereasdatabasesfortheuseinHISareprocesscenteredandcentralfortheelectronicpatientrecord,biomedicaldatabasesarelibrariesofallsortsoflifesciencedata,collectedfromscientificexperimentsandcomputationalanalyses.Suchdatabasescontainexperimentalbiologicaldatafromclinicalwork,genomics,proteomics,metabolomics,microarraygeneexpression,phylogenetics,pharmacogenomics,etc.Examples:Text:e.g.PubMed,OMIM(OnlineMendelian InheritanceinMan);Sequencedata:e.g.Entrez,GenBank (DNA),UniProt (protein).Proteinstructures:e.g.PDB,StructuralClassificationofProteins(SCOP),CATH(ProteinStructureClassification);Anoverviewcanbefoundhere:(Masic &Milinovic,2012),Onlineopenaccessvia:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3544328Note:Pharmacogenomicsisthetechnologyfortheanalyticsofhowgeneticmakeupaffectsanindividual'sresponsetodrugs– soitdealswiththeinfluenceofgeneticvariationondrugresponseinpatientsbycorrelatinggeneexpressionorsingle‐nucleotidepolymorphismswithefficacyandtoxicity.Thecentralaimistooptimizedrugtherapytoensuremaximumeffectivenesswithminimaladverseeffectsandisacoretowardspersonalizedmedicine.

37WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Agood videocanbeseenhere:https://www.youtube.com/watch?v=DSHhep_w6pk

TheProteinDataBankarchive‐informationaboutthe3Dshapesofproteins,nucleicacids,andcomplexassemblieshelpsstudentsandresearchersunderstandallaspectsofbiomedicine,fromproteinsynthesistohealthanddisease.AsamemberofthewwPDB,theRCSBPDBcuratesandannotatesPDBdata.

TheRCSBPDBbuildsuponthedatabycreatingtoolsandresourcesforresearchandeducationinmolecularbiology,structuralbiology,computationalbiology,andbeyond.

Remember:Proteinsarethemoleculesusedbythecellforperformingandcontrollingcellularprocesses,including:degradationandbiosynthesisofmolecules,physiologicalsignaling,energystorageandconversion,formationofcellularstructuresetc.Proteinstructuresaredeterminedwithcrystallographicx‐raymethodsorbynuclearmagneticresonancespectroscopy.Oncetheatomiccoordinatesoftheproteinstructurehavebeendetermined,atableofthesecoordinatesisdepositedintotheproteindatabase(PDB),aninternationalrepositoryfor3Dstructurefiles:http://www.rcsb.org/pdb/ThisdatabaseishandledbytheRCSB(ResearchCollaboratory forStructuralBiology)attheRutgersUniversityandUCSanDiego.PDBisthemostimportantsourceforproteinstructures.Beforeanewstructureofaproteinisadded,acarefulexaminationofthedatamustbecarriedouttoguaranteethequalityofthestructure.ThePDBdatafilecontains,amongothers,thecoordinatesofalltheatomsoftheprotein(Wiltgen &Holzinger,2005),(Wiltgen,Holzinger&Tilz,2007).

38WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

APDBstructureentryshouldbecitedwithitsPDBIDandprimaryreference.Forexample:PDBID:102LD.W.Heinz,W.A.Baase,F.W.Dahlquist,B.W.Matthews(1993)HowAmino‐AcidInsertionsareAllowedinanAlpha‐Helix

ofT4LysozymeNature361:561.

AnentrywithoutapublishedreferencecanbecitedwiththePDBID,authornames,andtitle:PDBID:1CI0W.Shi,D.A.Ostrov,S.E.Gerchman,V.Graziano,H.Kycia,B.Studier,S.C.Almo,S.K.Burley,NewYorkStructuralGenomiX

ResearchConsortium(NYSGXRC).TheStructureofPNPOxidasefromS.cerevisiae

AnentrymayalsobereferencedusingitsDigitalObjectIdentifier(DOI).TheDOIsforPDBentriesallhavethesameformat:10.2210/pdbXXXX/pdb,whereXXXXshouldbereplacedwiththedesiredPDBID.TheDOIcanbeusedaspartofaURLtoobtainthisdatafile(http://dx.doi.org/10.2210/pdb4hhb/pdb),orcanbeenteredinaDOIresolver(suchashttp://www.crossref.org/)toautomaticallylinktopdb4hhb.ent.gzonthemainPDBftparchive(ftp://ftp.wwpdb.org).Forexample,theDOIforPDBentry4HHBis"10.2210/pdb4hhb/pdb".ThislinksdirectlytotheentryinthePDBfileformatontheFTPserver.ImagesfromStructureSummarypagesshouldcitetheRCSBPDBandthePDBentry:ImagefromtheRCSBPDB(www.rcsb.org)ofPDBID1BNA(H.R.Drew,R.M.Wing,T.Takano,C.Broka,S.Tanaka,K.

Itakura,R.E.Dickerson (1981)StructureofaB‐DNAdodecamer:conformationanddynamicsProc.Natl.Acad.Sci.USA 78:2179‐2183).

ImagescreatedusingPDBdataandothersoftwareshouldcitethePDBIDandthemoleculargraphicsprogramused.Imageof1AOI(K.Luger,A.W.Mader,R.K.Richmond,D.F.Sargent,T.J.Richmond(1997)structureofthecoreparticleat2.8

AresolutionNature389:251‐260)createdwithProteinWorkshop(J.L.Moreland,A.Gramada,O.V.Buzko,Q.Zhang,P.E.Bourne(2005)TheMolecularBiologyToolkit(MBT):amodularplatformfordevelopingmolecularvisualizationapplications.BMCBioinformatics6:21).

39WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

RememberthestructuraldimensionswhichwediscussedinLecture1andLecture2.ThisSlideby(Kampen,2013)isaveryniceoverviewofvariousdatabasesaddressingthedifferentmicroscopicdimensions.Additionally,thedataonthelevelofthehospitalinformationsystemsareadded– sothatyouhaveagoodsummaryoftheaforementioned.IfwetakeasideLiteraturedatabasesandontologies(intheupperrightcornerofthisSlide)westartwith:Genomedatabases:Ensembl http://www.ensembl.org/index.htmlNucleotidesequenceEMBL‐Bankhttp://www.ebi.ac.uk/ena/Geneexpression:ArrayExpress http://www.ebi.ac.uk/arrayexpressProteomes:UniProt http://www.uniprot.org/Proteins:InterPro http://www.ebi.ac.uk/interpro/Proteinstructure:PDBhttp://www.rcsb.org/pdb/home/home.doProteinInteractions:IntAct http://www.ebi.ac.uk/intact/Chemicalentities:ChEMBL https://www.ebi.ac.uk/chembl/Pathways:Reactome http://www.reactome.org/Systems:BioModels http://www.ebi.ac.uk/biomodels‐main/

WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Ensembl (nottomixupwithEnsemble;‐)isagoodexampleforaGenomedatabaseandisajointprojectbetweentheEuropeanBioinformaticsInstituteandtheWellcome TrustSangerInstitute,whichwaslaunchedin1999inresponsetotheimminentcompletionoftheHumanGenomeProject(Flicek etal.,2011).Itsaimremainstoprovideacentralizedresourceforgeneticists,molecularbiologistsstudyingthegenomesofourownspeciesandothervertebratesandmodelorganisms.Ensembl providesoneofseveralwell‐knowngenomebrowsersfortheretrievalofgenomicinformation.

41WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

ArrayExpress isadatabaseoffunctionalgenomicsexperimentsthatcanbequeriedandthedatadownloaded.Itincludesgeneexpressiondatafrommicroarrayandhighthroughputsequencingstudies.DataiscollectedtoMIAMEandMINSEQEstandards.ExperimentsaresubmitteddirectlytoArrayExpress orareimportedfromtheNCBIGEOdatabase.MIAME=MinimumInformationAboutaMicroarrayExperiment.Thisisthedatathatisneededtoenabletheinterpretationoftheresultsoftheexperimentunambiguouslyandpotentiallytoreproducetheexperiment(Brazma etal.,2001).ThesixmostcriticalelementscontributingtowardsMIAMEare:1)Therawdataforeachhybridisation (e.g.,CELorGPRfiles),2)Thefinalprocessed(normalised)dataforthesetofhybridisations intheexperiment;3)Theessentialsampleannotationincludingexperimentalfactorsandtheirvalues,4)theexperimentaldesignincludingsampledatarelationships;5)Annotationofthearray(e.g.,geneidentifiers,genomiccoordinates,probeoligonucleotidesequencesorreferencecommercialarraycatalognumber),and6)Laboratoryanddataprocessingprotocols(e.g.,whatnormalisation methodhasbeenusedtoobtainthefinalprocesseddata);see:http://www.mged.org/Workgroups/MIAME/miame.html

42WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

IntAct isanopensourcedatabaseforprotein‐proteininteractions.Thewebinterfaceprovidesbothtextualandgraphicalrepresentationsofsuchproteininteractions,andallowsexploringinteractionnetworksinthecontextoftheGOannotationsoftheinteractingproteins.Moreover,awebserviceallowsdirectcomputationalaccesstoretrieveinteractionnetworksinXMLformat.IntActcontainsbinaryandcomplexinteractionsimportedfromtheliteratureandcuratedincollaborationwiththeSwiss‐Prot team,makingintensiveuseofcontrolledvocabulariestoensuredataconsistency(Hermjakob etal.,2004).http://www.ebi.ac.uk/intact

43WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

TheBioModels Databaseisafreely‐accessibleonlineresourceforstoring,viewing,retrieving,andanalyzingpublished,peer‐reviewedquantitativemodelsofbiochemicalandcellularsystems.Thestructureandbehaviorofeachsimulationmodelarethoroughlychecked;inaddition,modelelementsareannotatedwithtermsfromcontrolledvocabulariesaswellaslinkedtorelevantdataresources.Modelscanbeexaminedonlineordownloadedinvariousformatsandreactionnetworkdiagramscanbegeneratedfromthemodelsinseveralformats.BioModelsDatabasealsoprovidesfeaturessuchasonlinesimulationandtheextractionofcomponentsfromlargescalemodelsintosmallersub‐models.Thesystemprovidesarangeofwebservicesthatexternalsoftwaresystemscanusetoaccessup‐to‐datedatafromthedatabase(Lietal.,2010).http://www.ebi.ac.uk/biomodels/Note:Quantitativemodelsofbiochemicalandcellularsystemsareusedtoanswerresearchquestionsinthebiologicalsciencesanddigitalmodelingisofgrowinginterestinmolecularandsystemsbiology.Awell‐knownexampleistheVirtualHuman(Kell,2007).

44WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Thelargestmonasterylibraryoftheworld– agoodexampleforawell‐definedknowledgespace.

Yes,perfectlycorrect– thisGoldenRetrieverisbringingbackthewoodenstick– heisretrievingit.Thisisexactlywhatthewordtoretrievemeans:bringingsomethingback.

45WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Pleaseremember thebasicdifferencesbetweenretrievalanddiscovery:Retrievalisbringingbackanalreadyknownobject,whereasdiscoveryisfindingsomethingwhichwaspreviouslyunknown.Inotherwords:RetrievalisdealingwithknownobjectsandDisovery/Miningisfindingnewthings– inourcasenewinsight(sensemaking)intodata.Slide4‐26makesitclear:

Maimon &Rokach (2010)(Maimon &Rokach,2010)defineKnowledgeDiscoveryinDatabases(KDD)asanautomatic,exploratoryanalysisandmodelingoflargedatarepositoriesandtheorganizedprocessofidentifyingvalid,novel,usefulandunderstandablepatternsfromlargeandcomplexdatasets.DataMining(DM)isthecoreoftheKDDprocess(Witten,Frank&Hall,2011).ThetermKDDactuallygoesbacktothemachinelearningandArtificialIntelligence(AI)community(Piatetsky‐Shapiro,2000).Interestingly,thefirstapplicationinthisareawasagaininmedicalinformatics:TheprogramRxwasthefirstthatanalyzeddatafromabout50,000Stanfordpatientsandlookedforunexpectedside‐effectsofdrugs(Blum&Wiederhold,1985).ThetermreallybecamepopularwiththepaperbyFayyadetal.(1996)(Fayyad,Piatetsky‐Shapiro&Smyth,1996),whodescribedtheKDDprocessconsistingof9subsequentsteps:

46WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

1.Learningfromtheapplicationdomain:includesunderstandingrelevantpreviousknowledge,thegoalsoftheapplicationandacertainamountofdomainexpertise;2.Creatingatargetdataset:includesselectingadatasetorfocusingonasubsetofvariablesordatasamplesonwhichdiscoveryshallbeperformed;3.Datacleansing(andpreprocessing):includesremovingnoiseoroutliers,strategiesforhandlingmissingdata,etc.);4.Datareductionandprojection:includesfindingusefulfeaturestorepresentthedata,dimensionalityreduction,etc.;5.Choosingthefunctionofdatamining:includesdecidingthepurposeandprincipleofthemodelforminingalgorithms(e.g.,summarization,classification,regressionandclustering);6.Choosingthedataminingalgorithm:includesselectingmethod(s)tobeusedforsearchingforpatternsinthedata,suchasdecidingwhichmodelsandparametersmaybeappropriate(e.g.,modelsforcategoricaldataaredifferentfrommodelsonvectorsoverreals)andmatchingaparticulardataminingmethodwiththecriteriaoftheKDDprocess;7.Datamining:searchingforpatternsofinterestinarepresentationalformorasetofsuchrepresentations,includingclassificationrulesortrees,regression,clustering,sequencemodeling,dependencyandlineanalysis;8.Interpretation:includesinterpretingthediscoveredpatternsandpossiblyreturningtoanyoftheprevioussteps,aswellaspossiblevisualizationoftheextractedpatterns,removingredundantorirrelevantpatternsandtranslatingtheusefulonesintotermsunderstandablebyusers;9.Usingdiscoveredknowledge:includesincorporatingthisknowledgeintotheperformanceofthesystem,takingactionsbasedontheknowledgeordocumentingitandreportingittointerestedparties,aswellascheckingfor,andresolving,potentialconflictswithpreviouslybelievedknowledge(Holzinger,2013).

InInformationretrievalaqueryqisdefinedasaformulation(N,L)=qandthematcheswithanindexIMatching(q,I)retrievesrelevantdatatosatisfythesearchquery(Baeza‐Yates&Ribeiro‐Neto,2011).

47WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Pleaseremember thedifferencesbetweendataobjectsandinformationobjects–dataisanabstractrepresentationinthecomputationalspace– informationisperceivableforthecognitivespace(Notethatitdoesnotmeanthatinformationisautomaticallyknowledge–forgettingknowledgewemustusebothourperceptionandcognition,i.e.humanintelligence)

48WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

AnexcellentstartinthedeterminationbetweenDRandIRistheworkof(VanRijsbergen,1979):ThemostimportantdifferenceisthatthedatamodelinDRisdeterministic,whereaswespeakaboutprobableinformationintheIRModel,henceinformationretrievalisprobabilistic(Simonic&Holzinger,2010).*Monothetic =typeinwhichallmembersareidenticalonallcharacteristics;**Polythetic =typeinwhichallmembersaresimilar,butnotidentical;

49WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

IRcanbedefinedasare‐callofalreadyexistinginformation,notaimingatthediscoveryofnewstructuresasitisthegoalinKnowledgeDiscoveryandDataMining(see→Lecture6).Aswehavealreadyheardseveraltimes,inhospitalinformationsystemsmostofthedataconsistsofmedicaldocuments,whichconsistmostlyofunstructuredinformation:text.But:Whatistext?Fromacomputationalperspective,textconsistsofsequencesofcharacterstrings,thesyntax(Hotho,Nürnberger &Paaß,2005),henceitisanabstractrepresentationofnaturallanguageandthechallengesareinsemantics(meaning).TextprocessingbelongstothefieldofNaturallanguageprocessing(NLP)whichishighlyinterdisciplinary,dealingwiththeinteractionbetweenthecognitivespace(naturallanguages)andthecomputationalspace(formallanguages).Assuch,NLPiscloselyrelatedtoHCI.Textminingisasubfieldofdatamining.TheoriginalgoalofIRwastofinddocumentswhichcontainanswerstoquestionsandnotthefindingofanswersitself(Hearst,1999).Forthispurposestatisticalmeasuresandmethodsareused,andweneedaformaldescriptionfirst.

50WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Thisisthegeneralprinciple:Theenduserformulateshisqueryviatheuserinterface,informofaTextOperations(“userneed”).Thenextstepistherepresentation(logicaldocumentviewDintheformalmodelin→Slide4‐30)ofthedocumentsandtherepresentationofthereasoningstrategy,querylogicalviewQ(comparewith→Slide4‐30and→Slide4‐31).Theresultisarankingoftheretrieveddocuments,whichwillbedisplayedviatheuserinterface.

51WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

52

ModelingtheIR‐processiscomplex,becausewearedealingwithimprecise,vagueanduncertainelements,thusitisdifficulttoformalizeduetohighinfluencesofhumanfactors,i.e.relevanceandinformationneeds,whicharehighlysubjectiveandcontextspecific.However,inthedefinitionofanyIR‐modelwecanidentifysomecommonaspects(Canfora &Cerulo,2004).Thefirststepistherepresentationofdocumentsandinformationneeds.Fromtheserepresentationsareasoningstrategycanbedefined,whichsolvesarepresentationsimilarityproblemtocomputetherelevanceofdocumentswithrespecttothequeries.VariousstrategieshavebeenintroducedwiththeaimofimprovingtheIR‐process.Weclassifythesemethodologiesundertwomainaspects:Representation(query&document,seeSlide→4‐33)andReasoning(applicationofdiversemethods,see→Slide4‐34).LettheIRModelbeaquadruple

Eq.4‐1 IR={D,Q,F,R(q_i,d_j)}

Disasetcomposedoflogicalviews(representationcomponent)ofthedocumentswithinacollection;Qisasetoflogicalviews(representationcomponent)oftheuserinformationneeds(thesearecalledqueries);Fisaframeworkformodelingdocumentrepresentations,queriesandtheirrelationships(reasoningcomponent);ThisincludessetsandBooleanrelations,vectorsandlinearalgebraoperations,samplespacesandprobabilitydistributions;R(qi,dj)isarankingfunction(→Slide4‐31)thatassociatesarealnumberwithaqueryrepresentationqi Qandadocumentrepresentationdj D.Suchrankingdefinesanorderingamongthedocswithregardtothequeryqi.Theenduserin→Slide4‐29formulateshisqueryinformofatextoperation,thenextstepistherepresentation(logicalviewD)ofthedocumentsandtherepresentationofthereasoningstrategy,bothlogicalviewsDandQ(comparewithSlide4‐31)resultinarankingoftheretrieveddocuments.

WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

ThelogicalviewsDandQresultintherankingfunctionR(qi,dj)accordingto(Baeza‐Yates&Ribeiro‐Neto,2011)

Speak:Rindexed dsubscriptjandqsubscripti

53WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Guess whichalgorithmthisis?AshortdescriptioncanbefoundinHastie,T.,Tibshirani,R.&Friedman,J.2009.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.SecondEdition,NewYork,Springer.

54WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Yes!Alot differentmethods– everymethodhavingparticularadvantagesanddisadvantages– wecannotdiscussmuchhere,butwecangetaroughoverview.

55WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

TherepresentationcomponentisanessentialpartofeveryIRsystem,asitistherepresentationoftheinformationitself(visibletotheuser):informationcanbeprocessedifitisrepresentedinanappropriateway.Queriesaretherepresentationofinformationneedsofauser.Note:Atextcanbecharacterizedbyusingfourattributes:syntax,structure,semantics,andstyle.Atexthasagivensyntaxandastructure,whichareusuallydictatedbytheapplicationorbythepersonwhocreatedit.Textalsohassemantics,specifiedbytheauthorofthedocument.Additionally,adocumentmayhaveapresentationstyleassociatedwithit,whichspecifieshowitshouldbedisplayedorprinted.Inmanyapproachestotextrepresentationthestyleiscoupledwiththedocumentsyntaxandstructure(LaTeX).XMLseparatestherepresentationofsyntaxandstructures,definedeitherbyaDTDoranXSD,andstyle,whichiscapturedbyXSL(Canfora &Cerulo,2004).Note:Ann‐gramisasubsequenceofnitemsfromagivensequence.Theitemsinquestioncanbephonemes,syllables,letters,wordsorbasepairsaccordingtotheapplication.Ann‐gramofsize1isreferredtoasa"unigram";size2isa"bi‐gram"(or,lesscommonly,a"di‐gram");size3isa"tri‐gram";size4isa"four‐gram"andsize5ormoreissimplycalledan"n‐gram".Somelanguagemodelsbuiltfromn‐gramsare"(n−1)‐orderMarkovmodels".Ann‐grammodelisatypeofprobabilisticmodelforpredictingthenextiteminsuchasequence.n‐grammodelsareusedinvariousareasofstatisticalnaturallanguageprocessingandgeneticsequenceanalysis.

56WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Deeplearningalgorithmsarebasedondistributedrepresentations,withtheassumptionthatobserveddataisgeneratedbytheinteractionsofmanydifferentfactorsondifferentlevels.Deeplearningaddstheassumptionthatthesefactorsareorganizedintomultiplelevels,correspondingtodifferentlevelsofabstractionorcompositionandvariousnumbersoflayersandlayersizescanbeusedtoprovidedifferentamountsofabstraction.Bengio,Y.;Courville,A.;Vincent,P.(2013)."RepresentationLearning:AReviewandNewPerspectives".IEEETransactionsonPatternAnalysisandMachineIntelligence35(8):1798–1828

Reasoningreferstothesetofmethods,models,andtechnologiesusedtomatchdocumentandqueryrepresentationsintheretrievaltask.Strictlyrelatedwiththereasoningcomponentistheconceptofrelevance.TheprimarygoalofanIRsystemistoretrievethedocumentsrelevanttoaquery.Thereasoningcomponentdefinestheframeworktomeasuretherelevancebetweendocumentsandqueriesusingtheirrepresentations(Canfora &Cerulo,2004).Google,forexample,usesakeywordbasedvectorspacemodel(see→Slide4‐38)alongwithgraph‐basedprobabilitytheoriesandFuzzysettheories.Slide4‐35showsaconciseoverviewofsomeselectedmethods,accordingtovariousdocumentproperties.

57WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

TherearemanymethodsofIR,fordetailsconsultastandardreferencee.g.Baeza‐Yates&Ribeiro‐Neto (2011).SettheoreticapproachesincludetheClassicSet‐basedBoolean,theExtendedBooleanandtheFuzzyApproach;AlgebraicapproachesincludetheGeneralizedVectorModel,LatentSemanticIndexing(LSI),NeuralNetworks;andtheProbabilisticapproachincludesBayesianNetworks,LanguageModelsandInferenceNetworks.Wewilldiscussonlyafewandtheseverybriefly,sothatyouhaveaquickoverview:Thesettheoreticapproach:BooleanModelinSlide4‐36andSlide4‐37;theVectorSpaceModelinSlide4‐38toSlide4‐42;andtheProbabilisticModelinSlide4‐43toSlide4‐44.

58WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Documents/queriesarerepresentedasasetofindexterms;queriesareBooleanexpressions(AND,OR,NOT);FortheBooleanmodel,theindextermweightvariablesarebinary,i.e.w_(i,j)∈{0│1}.AqueryqisaconventionalBooleanexpression.Letq _dnf bethedisjunctivenormalformofthequeryq.Further,letq_ccbeanyoftheconjunctivecomponentsofq _dnf.Thesimilarityofadocumentd_j tothequeryqisdefinedas

Ifsim(d_j,q)=1thentheBooleanmodelpredictsthatthedocumentd_j isrelevanttoqueryq.Otherwisethepredictionisthatthedocumentisnotrelevant.Fordetailspleasereferto(Baeza‐Yates&Ribeiro‐Neto,2011)

59WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

TheBooleanModelhasseveraladvantages,includingeasytounderstand,exactformalismandthequerylanguageisexpressive;however,seriousdisadvantages,e.g.nopartialmatches,the“bag‐of‐words”representationdoesnotaccuratelyconsiderthesemanticsofdocuments(Vallet,Fernández &Castells,2005),andthequerylanguageiscomplicated,finallytheretrieveddocumentscannotberanked.

TheExtendedBooleanModel(EBM)by(Salton,Fox&Wu,1983)overcomessomedisadvantagesbymakinguseofpartialmatchingandtermweights,similarasinthevectorspacemodel.Moreover,asthevector‐processingsystemsuffersfromonemajordisadvantage:thestructureinherentinthestandardBooleanqueryformulationisabsent,theEBMcombinesthecharacteristicsoftheVectorSpaceModelwiththepropertiesofBooleanalgebra.Hence,theEBMcanalsobeapplied,whentheinitialquerystatementsareavailableasnaturallanguageformulationsofuserneeds,ratherthanasconventionalBooleanformulations.

60WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Thevectorspacemodel(VSM)representsdocumentsasvectorsinthem‐dimensionalspace(Salton,Wong&Yang,1975).Thus,documentscanbecomparedbyvectoroperationsandqueriescanbeperformedbyencodingthequerytermssimilartothedocumentsinaqueryvector.Thisqueryvectorcanbecomparedtoeachdocument,whichreturnsaresultlistbyorderingthedocumentsaccordingtothecomputedsimilarity.Themaintaskofthevectorspacerepresentationofdocumentsistofindanappropriateencodingofthefeaturevector.Eachelementofavectorusuallyrepresentsaword(see→Slide4‐40)ofthedocumentcollection.Thesizeofthevectorisdefinedbythenumberofwordsofthecompletedocumentcollection.Theeasiestwayofdocumentencodingistousebinarytermvectors,thatmeansavectorelementissetto1ifthecorrespondingwordisusedinthedocumentandto0ifthewordisnot(Equation4‐4).ThisencodingresultsinasimpleBooleancomparison.Toimprovetheperformanceusuallytermweightingschemesareused,wheretheweightsreflecttheimportanceofawordinaspecificdocumentoftheconsideredcollection.Largeweightsareassignedtotermsthatareusedfrequentlyinrelevantdocumentsbutrarelyinthewholedocumentcollection(Salton&Buckley,1988).Thusaweightw(d;t)foratermtindocumentdiscomputedbytermfrequencytf (d;t)timesinversedocumentfrequencyidf(t),whichdescribesthetermspecificitywithinthedocumentcollection.TherankingcanbemadebyusingtheCosinesSimilarity(see→Slide4‐41).Thecosineoftheanglebetweentwovectorsisameasureofhow“similar”theyare,whichinturn,isameasureofthesimilarityofthesestrings.Ifthevectorsareofunitlength,thecosineoftheanglebetweenthemissimplythedotproductofthevectors(Tata&Patel,2007).

61WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Asaresultwegetamatrixrepresentation,andnowwecanapplyvectoralgebra,orparticularlinearalgebra– herestillinR3.Mathematically,wecanworkinarbitrarilyhighdimensionalspaces.ThemajorprobleminvolvedisthemappingbackintoR2.Oneverypositiveaspectisthatwecanlookforgettingsparsematrices,i.e.wesavealotofcomputationalpower.

62WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Turney,P.D.&Pantel,P.2010.Fromfrequencytomeaning:Vectorspacemodelsofsemantics.Journalofartificialintelligenceresearch,37,(1),141‐188.

Computersunderstandverylittleofthemeaningofhumanlanguage.Thisprofoundlylimitsourabilitytogiveinstructionstocomputers,theabilityofcomputerstoexplaintheiractionstous,andtheabilityofcomputerstoanalyseandprocesstext.Vectorspacemodels(VSMs)ofsemanticsarebeginningtoaddresstheselimits.Turney etal. (2010)surveystheuseofVSMsforsemanticprocessingoftext.TheyorganizetheliteratureonVSMsaccordingtothestructureofthematrixinaVSM.TherearecurrentlythreebroadclassesofVSMs,basedonterm–document,word–context,andpair–patternmatrices,yieldingthreeclassesofapplications.Theysurveyabroadrangeofapplicationsinthesethreecategoriesandwetakeadetailedlookataspecificopensourceprojectineachcategory.TheirgoalinthissurveyistoshowthebreadthofapplicationsofVSMsforsemantics,toprovideanewperspectiveonVSMsforthosewhoarealreadyfamiliarwiththearea,andtoprovidepointersintotheliteratureforthosewhoarelessfamiliarwiththefield.

63WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Beim Retrievalverfahren wird ein Rankingähnlicher Dokumente über dieCosinusÄhnlichkeit im m‐Dimensionalen Vektorraum durchgeführt.

InformationNeedQ→ =( _1"," _2,…"," _ )Wird ein Rankingähnlicher Dokumente über dieCosinus Ähnlichkeit im mdimensionalen VectorSpaceModeldurchgeführt

DerVorteil dieser Methode ist,dass es ein einfaches mathematisches Modelldarstellt,DieMatrizen sind Sparse(ist alsoeine günstige Datenstruktur)Dasretrievalkann inO(n)durchgeführt werden,daher gibt es ein relativ schnellesranking

Nachteile:DieWortanordung geht verloren (BagofWordAnsatz).

Es gibt viele weitere Methoden,wie z.B.LatentSemanticAnalysis(LSA)usw.ProbabilisticLatentSemanticAnalysis(PLSA)LatentDirichlet Allocation(LDA)

64WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

TheadvantagesofthealgebraicVSMincludethatitiseasytounderstand,partialmatchesarepossible,documentscanbesortedbyrank,anditusesterm‐weightingschemes;ontheothersidethereisahighercomputationalefforttocalculatesimilarity,andthe“bag‐of‐words”representationdoesnotaccuratelyconsiderthesemanticsofdocuments(Vallet,Fernández &Castells,2005).

65WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Fortheprobabilisticmodel,theindexweightvariablesareallbinary,i.e.ωij∈[0,1],ωiq∈[0,1].Aqueryqisasubsetofindexterms.LetRbethesetofdocumentsknown(orinitiallyguessed)toberelevant.LetR̅bethecomplementofR(thisisthesetofnon‐relevantdocuments).LetP(R/dj)bedeprobabilitythatthedocumentdj isrelevanttothequeryqandP(R̅/dj)betheprobabilitythatdj isnonrelevanttoq.Thesimilaritysim(dj,q)ofthedocumentdj tothequeryqisdefinedastheratio:

66WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Asinallmodelswehavecertainprosandcons,theprobabilisticmodelhasabigadvantage:thedocumentscanberankedbyrelevance;however,onthedisadvantageoussideitisabinarymodel(binaryweights),theindextermsareassumedtobeindependentandlackofdocumentnormalizationandthereisaneedtoguesstheinitialseparationofdocumentsintorelevantandnon‐relevantsets.

67WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Well, therearetwomainmeasurements

68WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Recall andPrecision– hardasabone

Followingthisdefinition:Recall=Correct/(Correct+Missing)andPrecision=Correct/(Correct+Spurious)

PrecisionPisthefractionofretrieveddocumentsthatarerelevanttothesearch:P=|{setofrelevantdocs}∩{setoffounddocs}|/{setoffounddocs}RecallRisthefractionofthedocumentsthatarerelevanttothequerythataresuccessfullyretrieved:R=|{setofrelevantdocs}∩{setoffounddocs}|/{setofrelevantdocs}Acombinationofprecisionandrecallistheharmonicmeanofboth,whichiscalledF‐measure:F=2∙(P∙R)/(P+R)Inclassification5termsareused:truepositives(=correct);truenegatives(=correct);falsepositives(=spurious);falsenegatives(=spurious);notdetected(=missing).

69WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Inthisslideweseeanoverviewofthelinguisticprocessingpipelinethatdescribesthestepsthatareperformedfromthedocumenttoitssemanticrepresentation.ThedomainknowledgeusedinthesemanticretrievalsystemismodeledintheformofthemedicalsemanticnetworkIDMACSR(MSN).ItusestheWingert Nomenclature(WNC)asitsmedicalterminology.TheWNCisbasedontheGermanversionofSNOMEDdevelopedbyFriedrichWingert.AlthoughitsmainfocusisonGerman,it,toalesserextent,supportsseveralotherlanguagesincludingEnglishandFrench.TheMSNformsasimpleontologywhoseconceptsareorganizedinataxonomy(isA‐hierarchy)andamerology (anatomicalpartOfhierarchy).Furtherrelationsbetweenconceptsaremodeled bylabelededges.TheMSNisdividedintoseveralsubdomains,including:– topography(i.e.,anatomicalconcepts)– morphology(e.g.,fracture,fever)– function(e.g.,respiration)– diseases(e.g.,glaucoma)– agents(e.g.,pathogens,pharmaceuticalsubstances)Currently,theMSNcontainsmorethan90,000termsand300,000uniquerelations.Thequerylanguagefollowsasimplegrammar,namely:Query::=DisjunctionDisjunction::=Conjunction|Conjunction";"DisjunctionConjunction::=Atom|Atom","ConjunctionAtom::=Term|"!"TermThusaqueryformsaBooleanexpressionindisjunctiveformoversearchterms.Semanticqueryexpansionhasbeendiscussedinseveralpreviouswork(Kingsland,Harbourt,Syed&Schuyler,1993),(Aronson,Rindflesch &Browne,1994)(Efthimiadis,1996).Theapproachisasfollows:eachsearchtermisindexed(usingthelinguisticprocessingmethodsdescribedabove)andreplacedbytheidentifieroftheWNCconceptmatchingtheterm.TheseconceptidentifiersarecalledWNCindices.IfthesearchtermreferstoacombinationofseveralconceptsintheWNC(e.g.,Gastroparesis=Stomach+Paresis),thesearchtermisreplacedbyaconjunctionoftheWNC(Kreuzthaleretal.,2011).

70WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

AscanbeseenfromthisSlidethemedicaldomainexpertoutperformstheotherretrievalmethods,achievinghighprecisionatahighrecalllevel.Interestingly,thesemanticbasedinformationretrievaltoolachievesapproximatelythesamerecalllevelasthemedicaldomainexpertwhilehavingalowerprecisionvalue.Thisperformanceresultisgood,rememberingthefactwhateffortthemedicaldomainexperthastomaketotranslatetheinformationneedintoaquerystring.Incontrasttothis,theinputfortheinformationretrievaltoolisshortandclearsothereforelessefforthastobemadetotransformtheinformationneedtothequerylanguageunderstoodbytheinformationretrievaltool.Keywordsearchhasahighprecisionvaluebutalowerrecallvalue.Thisresultisclearwhenconsideringthefactthatinformationneedsthatcanbedescribedbyusingthesekeyword(s)willachieveahighprecisionvalue.So,ifdocumentsarefoundtheywillberelevantbuttherecalllevelwillgenerallysuffer.LookingattheSlide4‐47,keywordsearchachievesapproximatelythesameprecisionasIRToolOnebutafarworserecall.ItisalsopossiblethatnosearchresultsarefoundatallwhenusingthekeywordsearchmethodologyascanbeseenfortheNeubildung,Darm informationneed(seeAppendixBandAppendixA).Incontrasttothis,forthisinformationneed,IRToolOnehasaboutthesameprecisionrecalllevelsasthemedicaldomainexpert,reflectingthesemanticprocessingchainofthetool.TheLSAstatisticalretrievalmethodhas,whencomparedtotheothermethods,alowerprecisionforallmeasuredrecalllevels.ThisresultgivestheimpressionthatLSAisapplicableforgettinghighprecisionvaluesforaparticularamountofsearchresultsbuthardtousetoachievebothhighprecisionandhighrecallvalues,whichisneededforexampleinclinicalstudies.

71WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Thefutureofbigdatais…big andtherewillbemanychallengesforus tosolve!

72WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

The grandquestionsofthefutureishowtomakesenseoutofthedata– megaquestionsincludeare:“Whatisinteresting?”– and“Whatisrelevant?”

73WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

74WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

75WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

76WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Adverse DrugEvents(ADE)areverycommonandthereforetheorderentrymustbetakenspecialcareof.Themedicationordersindifferentmedicationsystems.(a)Kardex system,(b)TIMEDsystem,and(c)CPOEsystem.

Physiciansmustentertheirmedicationordersintothesystem;nursesmaynotacceptanyhand‐writtenprescription.Aphysicianentersamedicationorderbyselectingadruganditsdosageform,strength,administrationroute,dosageregimen,startdateandtime.

77WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

ComparisonshowedthatthemedicationorderingandadministrationprocessaftertheimplementationresemblesthatoftheKardex‐system,whileitiscompletelydifferentfromthatoftheTIMED‐system.InbothKardex andTIMEDunits,wecomparednurseattitudestowardsthecomputerizedprocessinthepost‐implementationphasewiththeirattitudestowardsthepaper‐basedprocessinthepreimplementation phase.

78WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

NO=NoStock

ThemedicationorderingandadministrationprocessesinKardex‐systemandTIMED‐system;MO(MedicationOrder);HIS(HospitalInformationSystem);NS(Non‐Stock);forrequestingurgentNSdrugs,nursesoftendirectlyreferredtothepharmacywithhand‐writtenrequests.

ComparisonofFigs.2and3showsthatthemedicationorderingandadministrationprocessaftertheimplementationresemblesthatoftheKardex‐system,whileitiscompletelydifferentfromthatoftheTIMED‐system.InbothKardex andTIMEDunits,wecomparednurseattitudestowardsthecomputerizedprocessinthepost‐implementationphasewiththeirattitudestowardsthepaper‐basedprocessinthepreimplementation phase.

Thereisnocleardefinitionaboutthis,butitisdefinitelyaboutmanagementofdata,informationandknowledgefordecisionsupport.Letuslookintoapracticalexample– physicianorder– wherealotoferrorshappenedinthepastduetoamessofpaperbasedordersproducingalotofpaperchaos(youallknowthepost‐itsyndrome)

79WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Observationsandresultsofinvestigations—includinghistory,signs,andsymptoms—areconvertedbyclinicalstaffintodecisionsandappropriateactions.Controlusuallyrequirestheuseofrecordsandexternalsourcesofknowledge

Thecareofeachpatientcanbeconsideredtobeacontrolloopin whichdatafromobservationsandinvestigationsleadtodecisionsandactionsdesignedtotakecareofapatient'sproblemsandtheirconsequencesinasafe,effective,andlegitimatemanner.Thisloopoccursinallspecialtiesandisthesourceofalltheactivitiesofahealthcarefacilitysuchasahospital.Thoughcomplex,theseactivitiescanbesetoutasfourconcentricshells.

80WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Theclinicalcontrolloopisatthecoreofacomplexorganisation representedbyfour“shells”thatexchangedata.Activity shellsofclinicalcontrolloop

Clinicalmanagementshell—Assessmentofobservationsandresultsofinvestigations.Formulationsofdecisionsincludingthosebasedonobservations,investigations,andprocedurescarriedoutduringaconsultation

Clinicaladministrativeshell—Administrativeactivitieswhichfacilitatetheclinicalmanagementshellandlinkittotheothershells,suchasarrangingappointmentsandinvestigations,clinicalcorrespondence,filingresults,andclinicalaudit

Clinicalservicesshell—Investigative,therapeutic,andgeneralservicesprovidedbylaboratories,imagingfacilities,therapyunits,operatingtheatres,wards,suppliesdepartments,transport,etc

Generalmanagementshell—Generalmanagementofhealthcare,byhospitalmanagers,financialcontrollers,healthcarepurchasers,andstatutoryauthorities

81WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Exampleofavisualizedinformationsystemarchitecture,hereofthecomputer‐supportedpartofthehospitalinformationsystemoftheMedicalSchoolHanoverfrom1984([1],p.9).

82WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Mayo’sEnterpriseDataModeling(EDM)providesacontextforMayoenterpriseactivities.

83WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Care2x1isagenericmulti‐languageopen‐sourceprojectthatimplementsamodernHospitalInformationSystem.TheprojectwasstartedinMay2002withthereleaseofthefirstbetaversionofCare2xbyanursewhowasdissatisfiedwiththeHISinthehospitalwherehewasworking.Untiltodaythedevelopmentteamhasgrowntoover100membersfromover20countries.Care2xisaweb‐basedHISthatisbuiltuponotheropen‐sourceprojects:theApachewebserverfromtheApacheFoundation(http://www.apache.org/),thescriptlanguagePHP(http://www.php.org/)andtherelationaldatabasemanagementsystemmySQL(http://www.mysql.com/).ThereexistseveralsourcecodebranchesthattrytointegratetheoptiontochoosefromotherRDBMSlikeOracleandpostgreSQL.Thelatteroneisalreadysupportedinthecurrentversionatthetimeofwriting:“deployment2.1”.Forourinvestigationswehavechosenthemostfeature‐richversionthatwasavailablefromtheCare2xwebpageinearlyfall2004.Thisreleasehadtheversionnumber“pre‐deployment2.0.2”.Someminordeficienciesthatwereportlatermayalreadybefixedinthecurrentversion“deployment2.1”.

84WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

This isjusttoshowyouanexampleofaglobaldatabaseschemaEachmolecule(“Molecules”table)mayhavemorethanoneconformation(“Conformations”table)anditmaycomefrommorethanonesource(“Sources”table).Therearetwotypesofexperiments(“Experiments”table)thataredoneonmolecules:computationaldockingandbiologicalassays.Theresults(“DockingResults”and“AssayResults”tables)oftheseexperimentswerecapturedinthedatabase.Eachtypeofexperimentisdoneonaparticularp53mutant(“Mutants”table)andhasascore(“Scores”table)associatedwithit.

85WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

OpenDatabase Connectivity– APIinCforaccessingDBMSSystemarchitectureandthehybridstrategytodataintegration.Dockingandsmallmoleculedatausethemediationapproach,whilethefunctionalandstructuralassaydatausethedatawarehousingapproach.TheCRDBisbothamediatorandadatawarehouse.“Mutants”and“Molecular”aredatamartsofthewarehouse.TheODBCdriversarewrappersinthemediationapproach.Dashedlinesindicateintegrationplannedinthefuture.

86WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Theatomiccoordinatesofaproteinaredepositedintotheproteindatabase(PDB),aninternationalrepositoryfor3Dstructurefiles.AtthemomentPDBcontainsmorethan26.000proteinstructures

87WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

Wewilldealwithvisualizationsinlecture9– herejustanappetizerwhatyoucandisplay

Thisshows acervicalcancerqueryvisualization.TheGenenodesarepositionedusingbothchromosomenumberandorganismname.ThispositioningmethodallowsuserstofocusonaparticulargeneandspeciesusingNVSS’ssliderfilters.Nodesaresize‐codedaccordingtotheirindegree,whichprovidesanadditionalvisualcueaboutthenode’simportance.

88WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

89WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

90WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

91WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

http://psychology.wikia.com/wiki/Information_retrievalhttp://www.eecs.wsu.edu/mgd/gdb.html(GraphDatasets)

92WS 2015

A. Holzinger                                                         LV 709.049                                         Mi, 04.11.2015

top related