new a. holzinger 709.049 mi, 04.11 - human-centered.ai · 2015. 11. 4. · biomedical data...
Post on 15-Oct-2020
0 Views
Preview:
TRANSCRIPT
Status asofDi,03.11.2015,10:00
Dear Students,welcometothe4thlectureofourcourse.Pleaserememberfromthelastlecture:modelingofknowledge,medicalOntologies,ClassificationeffortsandtheInternationalClassificationofDiseases(ICD);StandardizedNomenclatureofMedicineClinicalTerms(SNOMEDCT);MedicalSubjectHeadings(MeSH);UnifiedMedicalLanguageSystem(UMLS);
Pleasealwaysbeawareofthedefinitionofbiomedicalinformatics(MedizinischeInformatik):BiomedicalInformatics istheinter‐disciplinaryfieldthatstudiesandpursuestheeffectiveuseofbiomedicaldata,information,andknowledgeforscientificinquiry,problemsolving,anddecisionmaking,motivatedbyeffortstoimprovehumanhealth(and well‐being).
1WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
2WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Bayes’RuleBiomedicaldatawarehouseBusinesshospitalinformationsystemClinicalworkflowDataintegrationEnterprisedatamodelingInformationretrieval(IR)ProbabilisticModelQualityofinformationretrievalSettheoreticmodelVectorSpaceModel(VSM)
3WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
4WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
5WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
6WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
7WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
8WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Attheendofthisfourthlectureyou…
…haveanoverviewaboutthegeneralarchitectureofanHospitalInformationSystem(detailsinlecture10:MedicalInformationSystemsandBiomedicalKnowledgeManagement);…knowsomeprinciplesofhospitaldatabases;…haveanoverviewonsomebiomedicaldatabases;…arefamiliarwithsomebasicsofinformationretrieval.
9WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Amongstother problemsomekeychallengesinclude:Increasinglylargeandcomplexdatasets“BigData”duetodataintensiveresearchIncreasingamountsofnon‐standardizedandun‐structuredinformation(e.g.freetext)Dataquality,dataintegration,universalaccessPrivacy,security,safetyanddataprotectionissues(see→Lecture11)Timeaspectsindatabases(Gschwandtner,Gärtner,Aigner &Miksch,2012),(Johnston&Weis,2010).“BigDataresourcesareallawasteoftimeandmoneyifdataanalystscannotfind,orfailtocomprehend,thebasicinformationthatdescribesthedataheldintheresources(Berman,2013b)”DataidentificationiscertainlythemostunderappreciatedandleastunderstoodBigDataissue.Measurements,annotations,properties,andclassesofinformationhavenoinformationalmeaningunlesstheyareattachedtoanidentifierthatdistinguishesonedataobjectfromallotherdataobjectsandthatlinkstogetheralloftheinformationassociatedwiththeidentifieddataobject(Berman,2013a).Communicationofdatabetweenapplicationsystemsmustensuresecuritytoavoidimproperaccess,becausetrustorthelackthereof,isthemostessentialfactorblockingtheadoptionofrapidlyevolvingWebtechnologyparadigmsuchassoftwareasservice(SaaS)ordatadistributionservicessuchasCloudcomputing(Sreenivasaiah,2010).
10WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Beforewediscussinformationsystems andlearnaboutdatabases,letusstartwithalookintotheHospital…
11WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Letusstartwithalookintothehospital:Inthisslideweseeatypicalhospitalscenario:medicalprofessionalsaresurroundedbyinformationtechnology.Anolddreamofhospitalmanagerswasalwaystohavean“alldigitalhospital”todigitalizeallworkflowsandtostorealldatainanelectronicway– towardsapaperlesshospital.Althoughmuchefforthasbeenspenttowardsapaperlesshospital,mosthospitalsworldwidearestillfarawayfrombeinga“all‐digitalhospital”(Waterson,Glenn&Eason,2012).Aninterestingstudy:AllhospitalsintheprovinceofStyria(Austria)arewellequippedwithsophisticatedInformationTechnology,whichprovidesall‐encompassingon‐screenpatientinformation.Previousresearchmadeonthetheoreticalproperties,advantagesanddisadvantages,ofreadingfrompapervs.readingfromascreenhasresultedintheassumptionthatreadingfromascreenisslower,lessaccurateandmoretiring.However,recentflatscreentechnology,especiallyonthebasisofLCD,isofsuchhighqualitythatobviouslythisassumptionshouldnowbechallenged.Astheelectronicstorageandpresentationofinformationhasmanyadvantagesinadditiontoafastertransferandprocessingoftheinformation,theusageofelectronicscreensinclinicsshouldoutperformthetraditionalhardcopyinbothexecutionandpreferenceratings.InastudyintheCountyhospitalStyria,Austria,with111medicalprofessionals,workinginareal‐lifesetting,theywereeachaskedtoreadoriginalandauthenticdiagnosisreports,agynecologicalreportandaninternalmedicaldocument,onbothscreenandpaperinarandomlyassignedorder.ReadingcomprehensionwasmeasuredbytheChunkedReadingTest,andspeedandaccuracyofreadingperformancewasquantified.Inordertogetafullunderstandingoftheclinicians'preferences,subjectiveratingswerealsocollected.WilcoxonSignedRankTestsshowednosignificantdifferencesonreadingperformancebetweenpapervs.screen.However,medicalprofessionalsshowedasignificant(90%)preferenceforreadingfrompaper.Despitethehighqualityandthebenefitsofelectronicmedia,paperstillhassomequalitieswhichcannotprovidedelectronicallydodate(Holzingeretal.,2011).BTW:GrazUniversityHospital istheflagshiphospitaloftheStyrianKAGESwith23countyhospitalsandisamongstthelargesthospitalsinEurope.
12WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Megaissuesrelatedwithhospitalinformationsystemsinclude:dataintegration,datafusion,standardizationissues,clinicalprocessanalysis,modeling,complianceissues,evidencebasedtreatmentanddecisionsupport,privacy,security,safetyanddataprotectionandknowledgediscoveryanddatamining– allconnectedwiththecentraltopicofthislecture:databases.
BTW:TheKAGESusesopenMEDOCS basedonish.med whichisbasedonSAPR3,anoverviewaboutdifferentbusinesshospitalinformationsystemsvendorscanbefoundhere:
13WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
14
Theteamworkinthehospitalrequiresalotofcommunicationandinformationexchange.Thevisionofabusinessenterprisehospitalinformationsystemistocoverallworkflows,organizationalprocessesandinformationflowselectronically.
Note:Thequalityoftheworkofphysiciansisheavilyinfluencedbytheusabilityoftheiravailableequipment.Intheslideyouseeatypicalworkmeetingofmedicalprofessionals,wheretheydiscussthepatientcasesjointly.Itisimportanttostudyandunderstandtheworkflowsoftheendusersandtoinvolvethemintothedevelopmentofinformationsystemsasearlyaspossiblebyauser‐centereddesignprocess(Holzinger,2003).Experimentsshowedthatbystudyingtheworkflowstheengineersgetdeepinsightsintohowtodevelopanappropriateapplicationforaspecifiedtargetendusergroup(Holzinger,Geierhofer,Ackerl &Searle,2005).
WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Theaforementionedgoalofan“all‐digitalhospital”produces“bigdata”andremarkablymuchofthedataisunstructuredtext.Interestingly,themainandmostimportantoutputisthemedicalreport(Arztbrief):Intheexampleitisthereportofamedicalimage– nottheimageitselfistherelevantissue– itisthereport(Holzinger,Geierhofer &Errath,2007b).Thehandlingwithunstructureddataisamegachallengeandbringsalongalotofchallengesforcomputers.
15WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Letusbrieflycomparehumanintelligencewithmachineintelligence.Agoodexampleonthecomplexitywhichwearefacinginhospitalinformationprocessingarethedifferencesbetweenchessandhumannaturallanguageprocessing:Whereaschessisafinite,mathematicallywell‐definedsearchspace,hencewehaveawelldefinedcomputationalspace,withlimitednumbersofmovesandstatesandgroundedinexplicit,unambiguousmathematicalrules,humanlanguageisexactlytheopposite:Ambiguous,contextualandimplicit;groundedinthehumancognitivespace,withaseeminglyinfinitenumberofwaystoexpressoneandthesamemeaning.Note:IBMDeepBluedefeatedtheWorldChessChampionGarryKasparovinasix‐gamematchin1997.Therewereanumberoffactorsthatcontributedtothissuccess,including:asingle‐chipchesssearchengine,amassivelyparallelsystemwithmultiplelevelsofparallelism,astrongemphasisonsearchextensions,acomplexevaluationfunction,andeffectiveuseofaGrandmastergamedatabase.Technically,DeepBluewasamassivelyparallelsystemdesignedforcarryingoutchessgametreesearches.Thesystemwascomposedofa30‐nodeIBMRS/6000SPcomputerand480single‐chipchesssearchengines,with16chesschipsperSPprocessor.TheSPsystemconsistsof28nodeswith120MHzP2SCprocessors,and2nodeswith135MHzP2SCprocessors.Thenodescommunicatedwitheachotherviaahighspeedswitchandallnodeshad1GBofRAM,and4GBofdisk.Duringthe1997matchwithKasparov,thesystemrantheAIX4.2operatingsystem.ThechesschipsinDeepBluewereeachcapableofsearchingupto2.5millionchesspositionspersecond,andcommunicatewiththeirhostnodeviaafastmicrochannelbus(Campbell,Hoane &Hsu,2002).
16WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
17WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Youcanrememberwhat welearnedlastlectureaboutworkflowsandworkflowmodelling..
18WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Healthcareprocessesrequirethecooperationofdifferentorganizationalunitsandvariousmedicaldisciplines.Insuchanenvironmentoptimalprocesssupportbecomescrucial.Inthisslideweseeatypicalorganizationalprocessformedicalorderentryandresultreporting,whichisusedtocoordinatetheinter‐departmentalcommunicationbetweenaward(ambulatorysetting)andtheradiologyunit.Thedepictedprocessisnottailoredtoaspecificclinicalpathway,butshowsanexampleforacharacteristicorganizationalprocedureofthehospital:Anorder(inGerman:Anweisung,Verschreibung)isplacedbyaphysicianatthewardoratanambulatorysetting.Theindicationischeckedintheradiologydepartmentanddependingontheresulttheorderplacerisinformedwhethertherequesthasbeenrejectedorscheduled.Theactualradiologicalexaminationandcorrespondingdocumentationisdoneintheexaminationroom.Theradiologyreportisgeneratedafterwards,whichhastobevalidatedbythephysicianwithhissignature.Thereportissentbacktotheorderplacer.Thisisanexampleforafundamentalprocessofclinicalpracticeandcapturestheorganizationalknowledgenecessarytocoordinatethehealthcareprocessamongdifferentpeopleandorganizationalunits;i.e.,focusisonthesupportofcoreorganizationalprocesses(Lenz&Reichert,2007).
19WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Thisisjustthatyouhave anidea,howcomplicatedsuchprocessescanbeandyoucanimaginehowdifficultitistodigitalizeallinvolveddata
20WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Themedicaltreatmentprocessisoftendescribedasdiagnostic–therapeuticcycle(Bemmel&Musen,1997)including:observation,reasoning,andaction.Pleaserememberthatinmedicinewedealwithuncertaininformation(Holzinger&Simonic,2011)andeachpassofthediagnostic‐therapeuticcyclecanbeseenasastepindecreasingtheuncertaintyaboutthepatient’sdisease.Consequently,theobservationprocessalwaysstartswiththepatienthistory(“lookingintothepast”)andproceedswithdiagnosticprocedureswhichareselectedbasedonavailableinformation.TheaimoftheHISistoassisthealthcarepersonnelinmakinginformeddecisions.Maybethemostimportantquestiontobeansweredishowtodeterminewhatisrelevant.Availabilityofrelevantinformationisapreconditionfor(good)medicaldecisions– andthemedicalknowledgeguidesthesedecisions(Lenz&Reichert,2007).FollowingtheprinciplesofEvidencebasedmedicine(EBM)physiciansarerequiredtoformulatequestionsbasedonpatients’problems,searchtheliteratureforanswers,evaluatetheevidenceforitsvalidityandusefulness,andfinallyapplythe(new)informationtopatientstreatment(Hawkins,2005).Thelimitingfactoristheshorttimeaclinicianhastomakeadecision(Gigerenzer &Gaissmaier,2011).
21WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
22WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Thisslideshowsaclassicalconceptualmodel:Theheartisacentraldataandcommunicationstructure.Thepatients“enter”(logically)thesystemthroughtheadmissionontheleftside,transferanddischargefunctionsofthecoreandleavesthesystem,atleastpartially,throughtherightside.Inthemainfocusisacentraldatabase,althoughalternativesolutionshaveoptedforamoredistributedconstructionofdatabases;nonethelesscentralorderingprincipleshavetobekepttoachievethenecessaryintegrationofinformationandthedistributiontothevariouspointswhereitisneeded,beitintheareaofhospitalmanagementorinthefieldofcareprovision.Thiscentraldatabaseisservingthecentraloperationalpurposesofthehospitalinthecontextofitsdualgoals(Haux etal.,1998),(Reichertz,2006),(Haux,2006).
23WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Hereyou seethetypicalarchitectureofsuchasystem
ICU=IntensiveCareUnitNICU=NeonatalIntensiveCareUnitPICU=PediatricIntensiveCareUnit
Therearemanydifferentapplicationarchitecturesinuse,andwewillcometoitbacklater,in→Lecture10,soherejustONEexampleforaenterprisebusinesshospitalinformationsystemasitiscalledprofessionally.However,wewantnowtoconcentrateonsometechnicalissuesofdatabases.
24WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
In ahospitaltherearedata,data,data,…
Inthisclassicalimageby(Shortliffe,Perrault,Wiederhold &Fagan,2001)itbecomesveryobviousthatdatabasesarecentralcomponentsforanhospitalinformationsystem.Averyinterestingslideisthenext,whereweseeanhistoricalexamplefromthe“stone‐age”ofcomputerscience.
25WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Thispictureby(Gardner,Pryor&Warner,1999)isinsofarinterestingasitshowsusclearlyamegaissueuptothepresent:tointegrateandfusiondifferentdataandtomakeitaccessibletotheclinician.Whilethereismuchresearchontheintegrationofheterogeneousinformationsystems,ashortcomingisintheintegrationofavailabledata.Justtoclarifythedifferencesbetweendataintegrationanddatafusion:
Dataintegrationinvolvescombiningdataresidingindifferentdistributedsourcesandprovidinguserswithaunifiedviewofandaccesstothesedata.Ithasbecomethefocusofextensivetheoreticalandpracticalwork,andnumerousopenproblemsremainunsolved(Lenzerini,2002).Datafusionistheprocessofmergingmultiplerecordsrepresentingthesamereal‐worldobjectintoasingle,consistent,accurate,andusefulrepresentation(Bleiholder &Naumann,2008).ThetrendtowardsP4medicine(Predictive,Preventive,Participatory,Personalized)hasresultedinasheermassofthegenerated(‐omics)data,henceamainchallengeisintheintegrationandfusionofheterogeneousdatasources,especiallyintheintegrationofdatafromtheclinicaldomainwithsourcesfromthebiologicaldomain.
26WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Integration – datafusion– fordataanalysis– thecentralgoaltosupportdecisionmakingprocesses– datavirtualization– abstractlayer– businessintelligence–serviceorientedarchitecture
27WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Database(DB)istheorganizedcollectionofdatathroughacertaindatastructure(e.g.hash‐table,adjacencymatrix,graphstructure,etc.).Databasemanagementsystem(DBMS)issoftwarewhichoperatestheDB.WellknownDBMSsinclude:Oracle,IBMDB2,MicrosoftSQLServer,MicrosoftAccess,MySQL,SQLite.ExamplesforGraphDatabasesincludeInfoGrid,Neo4j,orBrightstarDB.TheusedDBisnotgenerallyportable,butdifferentDBMSscaninter‐operatebyusingstandardssuchasSQLandODBC.Databasesystem(DBS)=DB+DBMS.Thetermdatabasesystememphasizesthatdataismanagedintermsofaccuracy,availability,resilience,andusability.Datawarehouse(DWH)isanintegratedrepositoryusedforreportingandlongtermstorageofanalysisdata.DataMarts(DM)areaccesslayersofaDWHandareusedastemporaryrepositoriesfordataanalysis.
RecommendableReadinginclude:(Plattner,2013),(Robinson,Webber&Eifrem,2013):Robinson,I.,Webber,J.&Eifrem,E.2013.GraphDatabases,O'ReillyMedia.Plattner,H.2013.ACourseinIn‐MemoryDataManagement:TheInnerMechanicsofIn‐MemoryDatabases,HeidelbergNewYorkDordrechtLondon,Springer.Oneofthestandardtextbooksisthe6thediton of"DatabaseSystemConcepts"by(Silberschatz,Korth &Sudarshan,2010).
28WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
ADWHisanintegratedsystem,specificallydesignedforenterprisebusinessdecisionsupportandcanbeusedinhospitalsandinbiomedicalapplications.InSlide4‐13weseeanexampleofahospitaldatawarehouse:Onthelefttherearethe(heterogeneous)datasources,suchasPACS(PictureArchiving&CommunicationSystem)andRIS(RadiologicalInformationSystem),andapartfromthecoreHIS,somespecialdatabaseswhichcanalsoincludeproprietaryandlegacysystems.ForthedatastagingandareaserverstheCommonObjectRequestBrokerArchitecture(CORBA)isused,astandarddefinedbytheObjectManagementGroup(OMG)thatsupportsmultipleplatforminteroperability(Zhang,Zhang,Tjandra &Wong,2004).Thisisastandardhospitalinformationarchitectureand– typically‐ withnointegrationoflaboratorydatasourcesandmostofallnoOmics‐dataintegration,asforexamplefromthepathologyorabio‐bank.
29WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
ADWHcanbesubdividedintoso‐calleddatamarts(DM),whichcanbeseenasspecificaccesslayerofaDWH,orientedtoaspecificteam.Slide4‐14showsthearchitectureoftheMayoclinicDWH,whichisincrementallyinstantiatingeachcomponentofthearchitectureondemand.Dataintegrationproceedsfromlefttoright(leftmostyouseetheprimarydatasources;movingright,thedataareintegratedintostagingandreplicationservices,withfurtherrefinement).Thelayersare:1)Subjects=thehighestlevelareasthatdefinetheactivitiesoftheenterprise(e.g.Individual);2)Concepts=thecollectionsofdatathatarecontainedinoneormoresubjectareas(e.g.,Patient,Provider,Referrer,etc.);3)BusinessInformationModels=theorganizationofthedatathatsupporttheprocessesandworkflowsoftheenterprise’sdefinedConcepts.(Chute,Beck,Fisk&Mohr,2010)
30WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
CloudcomputingisagoodexampleforSoftwareasaservice – flexiblespacevianetwork– thisremindsustotheearlydaysofcomputingwithmainframecomputingandthin‐clientterminals.
31WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
AstandardenvironmentforproductionandprocessingofgenomicdatacanbeseeninSlide4‐15:Sequencinglabssubmittheirdatatolargedatabases,e.g.GenBank,NationalCenterofBiotechnologyInformation(NCBI);EuropeanBioinformaticsInstitute(EMBL)database;DNADataBankofJapan(DDBJ);ShortReadArchive(SRA);GeneExpressionOmnibus(GEO)orMicroarraydatabaseArrayExpress.Thesemaintain,organizeanddistributethesequencingdata.Mostusersaccesstheinformationeitherthroughweb‐basedapplicationsorthroughintegrators,suchasEnsembl,theUniversityofCaliforniaatSantaCruz(UCSC)GenomeBrowserorGalaxy.Theendusershavetodownloadgenomicdatafromtheseprimaryandsecondarysources(Stein,2010).Remember:SequencingistheprocessofdeterminingthepreciseorderofnucleotideswithinaDNAmoleculetodeterminetheorderofthefourbases—adenine,guanine,cytosine,andthymine—inastrandofDNA.TheadventofrapidDNAsequencingmethodshasgreatlyacceleratedbiologicalandmedicalresearchanddiscoveryandproduceslargedatasets.Sequencinghasbecomeindispensableforbasicbiologicalresearch,andinnumerousappliedfieldssuchasdiagnosticsandbiotechnology.Note:Abiobank isaphysicalplacewhichstoresbiologicalspecimens– andinsomecasesalsodata(Roden etal.,2008).
32WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Hereweseeacloud‐basedgenomeinformaticssystem.Insteadofseparategenomedatasetsstoredatvariouslocations,thedatasetsarestoredinthecloudasvirtualdatabases.Webservicesrunontopofthesedatasets,includingtheprimaryarchivesandtheintegrators,runningasvirtualmachineswithinthecloud.Casualusers,whoareaccustomedtoaccessingthedataviatheNCBI,DDBJ,Ensembl orUCSC,workasusual;thefactthattheseserversarelocatedinsidethecloudisinvisibletothem.Poweruserscancontinuetodownloadthedata,buthaveanattractivealternative.Insteadofmovingthedatatothecomputationalcluster,theymovethecomputationalclustertothedata(Stein,2010).Note:Cloudcomputingisbasedonsharingofresourcestoachievecoherenceandeconomiesofscaleoveranetwork(similartotheelectricitygrid).FourAtthefoundationofcloudcomputingisthebroaderconceptofconvergedinfrastructureandsharedservices.Cloudprovidersoffertheirservicesaccordingtoseveralfundamentalmodels1)Infrastructureasaservice(IaaS),2)Platformasaservice(PaaS),3)Softwareasaservice(SaaS)
33WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Justanexampleforacloudbasedservice:TheMasterIndexisthePACSCloudcoreentityandcontainsinformationaboutothermodules,includingGatewaysandCloudSlaves(repositoryanddatabase).Italsoprovidesauthenticationservicestoinstitutionalgatewaysandallidentifiableinformationrelatedwithpatientsarestoredinamasterindexdatabase,fundamentaltoensuresolutionsforconfidentialityandprivacy.TheCloudSlavesprovide,ononehand,storageofsightlessdata(objectsrepositories)and,onotherhand,adatabasecontainingallnoidentifiablemetadataextractedfromDICOMstudies,i.e.themostdemandingtaskconcerningcomputationalpower(Bastiao‐Silva,Costa,Silva&Oliveira,2011).
34WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Wehavetodeterminebetweenfederateddataandwarehouseddata.Afederateddatabasesystemisameta‐databasemanagementsystem,whichtransparentlymapsmultipleheterogeneousandautonomousdatabasesystemsintoasinglefederateddatabaseandthiscanbea“virtualdatabase”– withoutdataintegrationasitisindatawarehouses.Intheslidewecanseeonthey‐axisthedataintegrationarchitectureandonthex‐axistheknowledgerepresentationmethodologiesandwherecurrentdataintegrationsystemsliealongthiscontinuum.Theessenceofthisimageisthatthereisno“best‐solution”:Asystemdesignedtohavefullcontrolofdataandfastqueriescanhavedifficultyexpressingcomplexbiologicalconceptsandintegratingthem.SystemsthatemployhighlyexpressiveknowledgerepresentationmethodologiessuchasOntologiesaremoreabletorepresentandintegratecomplexbiologicalconceptsbuthavemuchlesstractablequeries(Louieetal.,2007).
35WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Obviously thereisadifferencebetweenthedatabasesfortheHospitalInformationSystemandthedatabaseswhichareusedforscientificwork.
36WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
WhereasdatabasesfortheuseinHISareprocesscenteredandcentralfortheelectronicpatientrecord,biomedicaldatabasesarelibrariesofallsortsoflifesciencedata,collectedfromscientificexperimentsandcomputationalanalyses.Suchdatabasescontainexperimentalbiologicaldatafromclinicalwork,genomics,proteomics,metabolomics,microarraygeneexpression,phylogenetics,pharmacogenomics,etc.Examples:Text:e.g.PubMed,OMIM(OnlineMendelian InheritanceinMan);Sequencedata:e.g.Entrez,GenBank (DNA),UniProt (protein).Proteinstructures:e.g.PDB,StructuralClassificationofProteins(SCOP),CATH(ProteinStructureClassification);Anoverviewcanbefoundhere:(Masic &Milinovic,2012),Onlineopenaccessvia:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3544328Note:Pharmacogenomicsisthetechnologyfortheanalyticsofhowgeneticmakeupaffectsanindividual'sresponsetodrugs– soitdealswiththeinfluenceofgeneticvariationondrugresponseinpatientsbycorrelatinggeneexpressionorsingle‐nucleotidepolymorphismswithefficacyandtoxicity.Thecentralaimistooptimizedrugtherapytoensuremaximumeffectivenesswithminimaladverseeffectsandisacoretowardspersonalizedmedicine.
37WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Agood videocanbeseenhere:https://www.youtube.com/watch?v=DSHhep_w6pk
TheProteinDataBankarchive‐informationaboutthe3Dshapesofproteins,nucleicacids,andcomplexassemblieshelpsstudentsandresearchersunderstandallaspectsofbiomedicine,fromproteinsynthesistohealthanddisease.AsamemberofthewwPDB,theRCSBPDBcuratesandannotatesPDBdata.
TheRCSBPDBbuildsuponthedatabycreatingtoolsandresourcesforresearchandeducationinmolecularbiology,structuralbiology,computationalbiology,andbeyond.
Remember:Proteinsarethemoleculesusedbythecellforperformingandcontrollingcellularprocesses,including:degradationandbiosynthesisofmolecules,physiologicalsignaling,energystorageandconversion,formationofcellularstructuresetc.Proteinstructuresaredeterminedwithcrystallographicx‐raymethodsorbynuclearmagneticresonancespectroscopy.Oncetheatomiccoordinatesoftheproteinstructurehavebeendetermined,atableofthesecoordinatesisdepositedintotheproteindatabase(PDB),aninternationalrepositoryfor3Dstructurefiles:http://www.rcsb.org/pdb/ThisdatabaseishandledbytheRCSB(ResearchCollaboratory forStructuralBiology)attheRutgersUniversityandUCSanDiego.PDBisthemostimportantsourceforproteinstructures.Beforeanewstructureofaproteinisadded,acarefulexaminationofthedatamustbecarriedouttoguaranteethequalityofthestructure.ThePDBdatafilecontains,amongothers,thecoordinatesofalltheatomsoftheprotein(Wiltgen &Holzinger,2005),(Wiltgen,Holzinger&Tilz,2007).
38WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
APDBstructureentryshouldbecitedwithitsPDBIDandprimaryreference.Forexample:PDBID:102LD.W.Heinz,W.A.Baase,F.W.Dahlquist,B.W.Matthews(1993)HowAmino‐AcidInsertionsareAllowedinanAlpha‐Helix
ofT4LysozymeNature361:561.
AnentrywithoutapublishedreferencecanbecitedwiththePDBID,authornames,andtitle:PDBID:1CI0W.Shi,D.A.Ostrov,S.E.Gerchman,V.Graziano,H.Kycia,B.Studier,S.C.Almo,S.K.Burley,NewYorkStructuralGenomiX
ResearchConsortium(NYSGXRC).TheStructureofPNPOxidasefromS.cerevisiae
AnentrymayalsobereferencedusingitsDigitalObjectIdentifier(DOI).TheDOIsforPDBentriesallhavethesameformat:10.2210/pdbXXXX/pdb,whereXXXXshouldbereplacedwiththedesiredPDBID.TheDOIcanbeusedaspartofaURLtoobtainthisdatafile(http://dx.doi.org/10.2210/pdb4hhb/pdb),orcanbeenteredinaDOIresolver(suchashttp://www.crossref.org/)toautomaticallylinktopdb4hhb.ent.gzonthemainPDBftparchive(ftp://ftp.wwpdb.org).Forexample,theDOIforPDBentry4HHBis"10.2210/pdb4hhb/pdb".ThislinksdirectlytotheentryinthePDBfileformatontheFTPserver.ImagesfromStructureSummarypagesshouldcitetheRCSBPDBandthePDBentry:ImagefromtheRCSBPDB(www.rcsb.org)ofPDBID1BNA(H.R.Drew,R.M.Wing,T.Takano,C.Broka,S.Tanaka,K.
Itakura,R.E.Dickerson (1981)StructureofaB‐DNAdodecamer:conformationanddynamicsProc.Natl.Acad.Sci.USA 78:2179‐2183).
ImagescreatedusingPDBdataandothersoftwareshouldcitethePDBIDandthemoleculargraphicsprogramused.Imageof1AOI(K.Luger,A.W.Mader,R.K.Richmond,D.F.Sargent,T.J.Richmond(1997)structureofthecoreparticleat2.8
AresolutionNature389:251‐260)createdwithProteinWorkshop(J.L.Moreland,A.Gramada,O.V.Buzko,Q.Zhang,P.E.Bourne(2005)TheMolecularBiologyToolkit(MBT):amodularplatformfordevelopingmolecularvisualizationapplications.BMCBioinformatics6:21).
39WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
RememberthestructuraldimensionswhichwediscussedinLecture1andLecture2.ThisSlideby(Kampen,2013)isaveryniceoverviewofvariousdatabasesaddressingthedifferentmicroscopicdimensions.Additionally,thedataonthelevelofthehospitalinformationsystemsareadded– sothatyouhaveagoodsummaryoftheaforementioned.IfwetakeasideLiteraturedatabasesandontologies(intheupperrightcornerofthisSlide)westartwith:Genomedatabases:Ensembl http://www.ensembl.org/index.htmlNucleotidesequenceEMBL‐Bankhttp://www.ebi.ac.uk/ena/Geneexpression:ArrayExpress http://www.ebi.ac.uk/arrayexpressProteomes:UniProt http://www.uniprot.org/Proteins:InterPro http://www.ebi.ac.uk/interpro/Proteinstructure:PDBhttp://www.rcsb.org/pdb/home/home.doProteinInteractions:IntAct http://www.ebi.ac.uk/intact/Chemicalentities:ChEMBL https://www.ebi.ac.uk/chembl/Pathways:Reactome http://www.reactome.org/Systems:BioModels http://www.ebi.ac.uk/biomodels‐main/
WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Ensembl (nottomixupwithEnsemble;‐)isagoodexampleforaGenomedatabaseandisajointprojectbetweentheEuropeanBioinformaticsInstituteandtheWellcome TrustSangerInstitute,whichwaslaunchedin1999inresponsetotheimminentcompletionoftheHumanGenomeProject(Flicek etal.,2011).Itsaimremainstoprovideacentralizedresourceforgeneticists,molecularbiologistsstudyingthegenomesofourownspeciesandothervertebratesandmodelorganisms.Ensembl providesoneofseveralwell‐knowngenomebrowsersfortheretrievalofgenomicinformation.
41WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
ArrayExpress isadatabaseoffunctionalgenomicsexperimentsthatcanbequeriedandthedatadownloaded.Itincludesgeneexpressiondatafrommicroarrayandhighthroughputsequencingstudies.DataiscollectedtoMIAMEandMINSEQEstandards.ExperimentsaresubmitteddirectlytoArrayExpress orareimportedfromtheNCBIGEOdatabase.MIAME=MinimumInformationAboutaMicroarrayExperiment.Thisisthedatathatisneededtoenabletheinterpretationoftheresultsoftheexperimentunambiguouslyandpotentiallytoreproducetheexperiment(Brazma etal.,2001).ThesixmostcriticalelementscontributingtowardsMIAMEare:1)Therawdataforeachhybridisation (e.g.,CELorGPRfiles),2)Thefinalprocessed(normalised)dataforthesetofhybridisations intheexperiment;3)Theessentialsampleannotationincludingexperimentalfactorsandtheirvalues,4)theexperimentaldesignincludingsampledatarelationships;5)Annotationofthearray(e.g.,geneidentifiers,genomiccoordinates,probeoligonucleotidesequencesorreferencecommercialarraycatalognumber),and6)Laboratoryanddataprocessingprotocols(e.g.,whatnormalisation methodhasbeenusedtoobtainthefinalprocesseddata);see:http://www.mged.org/Workgroups/MIAME/miame.html
42WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
IntAct isanopensourcedatabaseforprotein‐proteininteractions.Thewebinterfaceprovidesbothtextualandgraphicalrepresentationsofsuchproteininteractions,andallowsexploringinteractionnetworksinthecontextoftheGOannotationsoftheinteractingproteins.Moreover,awebserviceallowsdirectcomputationalaccesstoretrieveinteractionnetworksinXMLformat.IntActcontainsbinaryandcomplexinteractionsimportedfromtheliteratureandcuratedincollaborationwiththeSwiss‐Prot team,makingintensiveuseofcontrolledvocabulariestoensuredataconsistency(Hermjakob etal.,2004).http://www.ebi.ac.uk/intact
43WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
TheBioModels Databaseisafreely‐accessibleonlineresourceforstoring,viewing,retrieving,andanalyzingpublished,peer‐reviewedquantitativemodelsofbiochemicalandcellularsystems.Thestructureandbehaviorofeachsimulationmodelarethoroughlychecked;inaddition,modelelementsareannotatedwithtermsfromcontrolledvocabulariesaswellaslinkedtorelevantdataresources.Modelscanbeexaminedonlineordownloadedinvariousformatsandreactionnetworkdiagramscanbegeneratedfromthemodelsinseveralformats.BioModelsDatabasealsoprovidesfeaturessuchasonlinesimulationandtheextractionofcomponentsfromlargescalemodelsintosmallersub‐models.Thesystemprovidesarangeofwebservicesthatexternalsoftwaresystemscanusetoaccessup‐to‐datedatafromthedatabase(Lietal.,2010).http://www.ebi.ac.uk/biomodels/Note:Quantitativemodelsofbiochemicalandcellularsystemsareusedtoanswerresearchquestionsinthebiologicalsciencesanddigitalmodelingisofgrowinginterestinmolecularandsystemsbiology.Awell‐knownexampleistheVirtualHuman(Kell,2007).
44WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Thelargestmonasterylibraryoftheworld– agoodexampleforawell‐definedknowledgespace.
Yes,perfectlycorrect– thisGoldenRetrieverisbringingbackthewoodenstick– heisretrievingit.Thisisexactlywhatthewordtoretrievemeans:bringingsomethingback.
45WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Pleaseremember thebasicdifferencesbetweenretrievalanddiscovery:Retrievalisbringingbackanalreadyknownobject,whereasdiscoveryisfindingsomethingwhichwaspreviouslyunknown.Inotherwords:RetrievalisdealingwithknownobjectsandDisovery/Miningisfindingnewthings– inourcasenewinsight(sensemaking)intodata.Slide4‐26makesitclear:
Maimon &Rokach (2010)(Maimon &Rokach,2010)defineKnowledgeDiscoveryinDatabases(KDD)asanautomatic,exploratoryanalysisandmodelingoflargedatarepositoriesandtheorganizedprocessofidentifyingvalid,novel,usefulandunderstandablepatternsfromlargeandcomplexdatasets.DataMining(DM)isthecoreoftheKDDprocess(Witten,Frank&Hall,2011).ThetermKDDactuallygoesbacktothemachinelearningandArtificialIntelligence(AI)community(Piatetsky‐Shapiro,2000).Interestingly,thefirstapplicationinthisareawasagaininmedicalinformatics:TheprogramRxwasthefirstthatanalyzeddatafromabout50,000Stanfordpatientsandlookedforunexpectedside‐effectsofdrugs(Blum&Wiederhold,1985).ThetermreallybecamepopularwiththepaperbyFayyadetal.(1996)(Fayyad,Piatetsky‐Shapiro&Smyth,1996),whodescribedtheKDDprocessconsistingof9subsequentsteps:
46WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
1.Learningfromtheapplicationdomain:includesunderstandingrelevantpreviousknowledge,thegoalsoftheapplicationandacertainamountofdomainexpertise;2.Creatingatargetdataset:includesselectingadatasetorfocusingonasubsetofvariablesordatasamplesonwhichdiscoveryshallbeperformed;3.Datacleansing(andpreprocessing):includesremovingnoiseoroutliers,strategiesforhandlingmissingdata,etc.);4.Datareductionandprojection:includesfindingusefulfeaturestorepresentthedata,dimensionalityreduction,etc.;5.Choosingthefunctionofdatamining:includesdecidingthepurposeandprincipleofthemodelforminingalgorithms(e.g.,summarization,classification,regressionandclustering);6.Choosingthedataminingalgorithm:includesselectingmethod(s)tobeusedforsearchingforpatternsinthedata,suchasdecidingwhichmodelsandparametersmaybeappropriate(e.g.,modelsforcategoricaldataaredifferentfrommodelsonvectorsoverreals)andmatchingaparticulardataminingmethodwiththecriteriaoftheKDDprocess;7.Datamining:searchingforpatternsofinterestinarepresentationalformorasetofsuchrepresentations,includingclassificationrulesortrees,regression,clustering,sequencemodeling,dependencyandlineanalysis;8.Interpretation:includesinterpretingthediscoveredpatternsandpossiblyreturningtoanyoftheprevioussteps,aswellaspossiblevisualizationoftheextractedpatterns,removingredundantorirrelevantpatternsandtranslatingtheusefulonesintotermsunderstandablebyusers;9.Usingdiscoveredknowledge:includesincorporatingthisknowledgeintotheperformanceofthesystem,takingactionsbasedontheknowledgeordocumentingitandreportingittointerestedparties,aswellascheckingfor,andresolving,potentialconflictswithpreviouslybelievedknowledge(Holzinger,2013).
InInformationretrievalaqueryqisdefinedasaformulation(N,L)=qandthematcheswithanindexIMatching(q,I)retrievesrelevantdatatosatisfythesearchquery(Baeza‐Yates&Ribeiro‐Neto,2011).
47WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Pleaseremember thedifferencesbetweendataobjectsandinformationobjects–dataisanabstractrepresentationinthecomputationalspace– informationisperceivableforthecognitivespace(Notethatitdoesnotmeanthatinformationisautomaticallyknowledge–forgettingknowledgewemustusebothourperceptionandcognition,i.e.humanintelligence)
48WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
AnexcellentstartinthedeterminationbetweenDRandIRistheworkof(VanRijsbergen,1979):ThemostimportantdifferenceisthatthedatamodelinDRisdeterministic,whereaswespeakaboutprobableinformationintheIRModel,henceinformationretrievalisprobabilistic(Simonic&Holzinger,2010).*Monothetic =typeinwhichallmembersareidenticalonallcharacteristics;**Polythetic =typeinwhichallmembersaresimilar,butnotidentical;
49WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
IRcanbedefinedasare‐callofalreadyexistinginformation,notaimingatthediscoveryofnewstructuresasitisthegoalinKnowledgeDiscoveryandDataMining(see→Lecture6).Aswehavealreadyheardseveraltimes,inhospitalinformationsystemsmostofthedataconsistsofmedicaldocuments,whichconsistmostlyofunstructuredinformation:text.But:Whatistext?Fromacomputationalperspective,textconsistsofsequencesofcharacterstrings,thesyntax(Hotho,Nürnberger &Paaß,2005),henceitisanabstractrepresentationofnaturallanguageandthechallengesareinsemantics(meaning).TextprocessingbelongstothefieldofNaturallanguageprocessing(NLP)whichishighlyinterdisciplinary,dealingwiththeinteractionbetweenthecognitivespace(naturallanguages)andthecomputationalspace(formallanguages).Assuch,NLPiscloselyrelatedtoHCI.Textminingisasubfieldofdatamining.TheoriginalgoalofIRwastofinddocumentswhichcontainanswerstoquestionsandnotthefindingofanswersitself(Hearst,1999).Forthispurposestatisticalmeasuresandmethodsareused,andweneedaformaldescriptionfirst.
50WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Thisisthegeneralprinciple:Theenduserformulateshisqueryviatheuserinterface,informofaTextOperations(“userneed”).Thenextstepistherepresentation(logicaldocumentviewDintheformalmodelin→Slide4‐30)ofthedocumentsandtherepresentationofthereasoningstrategy,querylogicalviewQ(comparewith→Slide4‐30and→Slide4‐31).Theresultisarankingoftheretrieveddocuments,whichwillbedisplayedviatheuserinterface.
51WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
52
ModelingtheIR‐processiscomplex,becausewearedealingwithimprecise,vagueanduncertainelements,thusitisdifficulttoformalizeduetohighinfluencesofhumanfactors,i.e.relevanceandinformationneeds,whicharehighlysubjectiveandcontextspecific.However,inthedefinitionofanyIR‐modelwecanidentifysomecommonaspects(Canfora &Cerulo,2004).Thefirststepistherepresentationofdocumentsandinformationneeds.Fromtheserepresentationsareasoningstrategycanbedefined,whichsolvesarepresentationsimilarityproblemtocomputetherelevanceofdocumentswithrespecttothequeries.VariousstrategieshavebeenintroducedwiththeaimofimprovingtheIR‐process.Weclassifythesemethodologiesundertwomainaspects:Representation(query&document,seeSlide→4‐33)andReasoning(applicationofdiversemethods,see→Slide4‐34).LettheIRModelbeaquadruple
Eq.4‐1 IR={D,Q,F,R(q_i,d_j)}
Disasetcomposedoflogicalviews(representationcomponent)ofthedocumentswithinacollection;Qisasetoflogicalviews(representationcomponent)oftheuserinformationneeds(thesearecalledqueries);Fisaframeworkformodelingdocumentrepresentations,queriesandtheirrelationships(reasoningcomponent);ThisincludessetsandBooleanrelations,vectorsandlinearalgebraoperations,samplespacesandprobabilitydistributions;R(qi,dj)isarankingfunction(→Slide4‐31)thatassociatesarealnumberwithaqueryrepresentationqi Qandadocumentrepresentationdj D.Suchrankingdefinesanorderingamongthedocswithregardtothequeryqi.Theenduserin→Slide4‐29formulateshisqueryinformofatextoperation,thenextstepistherepresentation(logicalviewD)ofthedocumentsandtherepresentationofthereasoningstrategy,bothlogicalviewsDandQ(comparewithSlide4‐31)resultinarankingoftheretrieveddocuments.
WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
ThelogicalviewsDandQresultintherankingfunctionR(qi,dj)accordingto(Baeza‐Yates&Ribeiro‐Neto,2011)
Speak:Rindexed dsubscriptjandqsubscripti
53WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Guess whichalgorithmthisis?AshortdescriptioncanbefoundinHastie,T.,Tibshirani,R.&Friedman,J.2009.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.SecondEdition,NewYork,Springer.
54WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Yes!Alot differentmethods– everymethodhavingparticularadvantagesanddisadvantages– wecannotdiscussmuchhere,butwecangetaroughoverview.
55WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
TherepresentationcomponentisanessentialpartofeveryIRsystem,asitistherepresentationoftheinformationitself(visibletotheuser):informationcanbeprocessedifitisrepresentedinanappropriateway.Queriesaretherepresentationofinformationneedsofauser.Note:Atextcanbecharacterizedbyusingfourattributes:syntax,structure,semantics,andstyle.Atexthasagivensyntaxandastructure,whichareusuallydictatedbytheapplicationorbythepersonwhocreatedit.Textalsohassemantics,specifiedbytheauthorofthedocument.Additionally,adocumentmayhaveapresentationstyleassociatedwithit,whichspecifieshowitshouldbedisplayedorprinted.Inmanyapproachestotextrepresentationthestyleiscoupledwiththedocumentsyntaxandstructure(LaTeX).XMLseparatestherepresentationofsyntaxandstructures,definedeitherbyaDTDoranXSD,andstyle,whichiscapturedbyXSL(Canfora &Cerulo,2004).Note:Ann‐gramisasubsequenceofnitemsfromagivensequence.Theitemsinquestioncanbephonemes,syllables,letters,wordsorbasepairsaccordingtotheapplication.Ann‐gramofsize1isreferredtoasa"unigram";size2isa"bi‐gram"(or,lesscommonly,a"di‐gram");size3isa"tri‐gram";size4isa"four‐gram"andsize5ormoreissimplycalledan"n‐gram".Somelanguagemodelsbuiltfromn‐gramsare"(n−1)‐orderMarkovmodels".Ann‐grammodelisatypeofprobabilisticmodelforpredictingthenextiteminsuchasequence.n‐grammodelsareusedinvariousareasofstatisticalnaturallanguageprocessingandgeneticsequenceanalysis.
56WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Deeplearningalgorithmsarebasedondistributedrepresentations,withtheassumptionthatobserveddataisgeneratedbytheinteractionsofmanydifferentfactorsondifferentlevels.Deeplearningaddstheassumptionthatthesefactorsareorganizedintomultiplelevels,correspondingtodifferentlevelsofabstractionorcompositionandvariousnumbersoflayersandlayersizescanbeusedtoprovidedifferentamountsofabstraction.Bengio,Y.;Courville,A.;Vincent,P.(2013)."RepresentationLearning:AReviewandNewPerspectives".IEEETransactionsonPatternAnalysisandMachineIntelligence35(8):1798–1828
Reasoningreferstothesetofmethods,models,andtechnologiesusedtomatchdocumentandqueryrepresentationsintheretrievaltask.Strictlyrelatedwiththereasoningcomponentistheconceptofrelevance.TheprimarygoalofanIRsystemistoretrievethedocumentsrelevanttoaquery.Thereasoningcomponentdefinestheframeworktomeasuretherelevancebetweendocumentsandqueriesusingtheirrepresentations(Canfora &Cerulo,2004).Google,forexample,usesakeywordbasedvectorspacemodel(see→Slide4‐38)alongwithgraph‐basedprobabilitytheoriesandFuzzysettheories.Slide4‐35showsaconciseoverviewofsomeselectedmethods,accordingtovariousdocumentproperties.
57WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
TherearemanymethodsofIR,fordetailsconsultastandardreferencee.g.Baeza‐Yates&Ribeiro‐Neto (2011).SettheoreticapproachesincludetheClassicSet‐basedBoolean,theExtendedBooleanandtheFuzzyApproach;AlgebraicapproachesincludetheGeneralizedVectorModel,LatentSemanticIndexing(LSI),NeuralNetworks;andtheProbabilisticapproachincludesBayesianNetworks,LanguageModelsandInferenceNetworks.Wewilldiscussonlyafewandtheseverybriefly,sothatyouhaveaquickoverview:Thesettheoreticapproach:BooleanModelinSlide4‐36andSlide4‐37;theVectorSpaceModelinSlide4‐38toSlide4‐42;andtheProbabilisticModelinSlide4‐43toSlide4‐44.
58WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Documents/queriesarerepresentedasasetofindexterms;queriesareBooleanexpressions(AND,OR,NOT);FortheBooleanmodel,theindextermweightvariablesarebinary,i.e.w_(i,j)∈{0│1}.AqueryqisaconventionalBooleanexpression.Letq _dnf bethedisjunctivenormalformofthequeryq.Further,letq_ccbeanyoftheconjunctivecomponentsofq _dnf.Thesimilarityofadocumentd_j tothequeryqisdefinedas
Ifsim(d_j,q)=1thentheBooleanmodelpredictsthatthedocumentd_j isrelevanttoqueryq.Otherwisethepredictionisthatthedocumentisnotrelevant.Fordetailspleasereferto(Baeza‐Yates&Ribeiro‐Neto,2011)
59WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
TheBooleanModelhasseveraladvantages,includingeasytounderstand,exactformalismandthequerylanguageisexpressive;however,seriousdisadvantages,e.g.nopartialmatches,the“bag‐of‐words”representationdoesnotaccuratelyconsiderthesemanticsofdocuments(Vallet,Fernández &Castells,2005),andthequerylanguageiscomplicated,finallytheretrieveddocumentscannotberanked.
TheExtendedBooleanModel(EBM)by(Salton,Fox&Wu,1983)overcomessomedisadvantagesbymakinguseofpartialmatchingandtermweights,similarasinthevectorspacemodel.Moreover,asthevector‐processingsystemsuffersfromonemajordisadvantage:thestructureinherentinthestandardBooleanqueryformulationisabsent,theEBMcombinesthecharacteristicsoftheVectorSpaceModelwiththepropertiesofBooleanalgebra.Hence,theEBMcanalsobeapplied,whentheinitialquerystatementsareavailableasnaturallanguageformulationsofuserneeds,ratherthanasconventionalBooleanformulations.
60WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Thevectorspacemodel(VSM)representsdocumentsasvectorsinthem‐dimensionalspace(Salton,Wong&Yang,1975).Thus,documentscanbecomparedbyvectoroperationsandqueriescanbeperformedbyencodingthequerytermssimilartothedocumentsinaqueryvector.Thisqueryvectorcanbecomparedtoeachdocument,whichreturnsaresultlistbyorderingthedocumentsaccordingtothecomputedsimilarity.Themaintaskofthevectorspacerepresentationofdocumentsistofindanappropriateencodingofthefeaturevector.Eachelementofavectorusuallyrepresentsaword(see→Slide4‐40)ofthedocumentcollection.Thesizeofthevectorisdefinedbythenumberofwordsofthecompletedocumentcollection.Theeasiestwayofdocumentencodingistousebinarytermvectors,thatmeansavectorelementissetto1ifthecorrespondingwordisusedinthedocumentandto0ifthewordisnot(Equation4‐4).ThisencodingresultsinasimpleBooleancomparison.Toimprovetheperformanceusuallytermweightingschemesareused,wheretheweightsreflecttheimportanceofawordinaspecificdocumentoftheconsideredcollection.Largeweightsareassignedtotermsthatareusedfrequentlyinrelevantdocumentsbutrarelyinthewholedocumentcollection(Salton&Buckley,1988).Thusaweightw(d;t)foratermtindocumentdiscomputedbytermfrequencytf (d;t)timesinversedocumentfrequencyidf(t),whichdescribesthetermspecificitywithinthedocumentcollection.TherankingcanbemadebyusingtheCosinesSimilarity(see→Slide4‐41).Thecosineoftheanglebetweentwovectorsisameasureofhow“similar”theyare,whichinturn,isameasureofthesimilarityofthesestrings.Ifthevectorsareofunitlength,thecosineoftheanglebetweenthemissimplythedotproductofthevectors(Tata&Patel,2007).
61WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Asaresultwegetamatrixrepresentation,andnowwecanapplyvectoralgebra,orparticularlinearalgebra– herestillinR3.Mathematically,wecanworkinarbitrarilyhighdimensionalspaces.ThemajorprobleminvolvedisthemappingbackintoR2.Oneverypositiveaspectisthatwecanlookforgettingsparsematrices,i.e.wesavealotofcomputationalpower.
62WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Turney,P.D.&Pantel,P.2010.Fromfrequencytomeaning:Vectorspacemodelsofsemantics.Journalofartificialintelligenceresearch,37,(1),141‐188.
Computersunderstandverylittleofthemeaningofhumanlanguage.Thisprofoundlylimitsourabilitytogiveinstructionstocomputers,theabilityofcomputerstoexplaintheiractionstous,andtheabilityofcomputerstoanalyseandprocesstext.Vectorspacemodels(VSMs)ofsemanticsarebeginningtoaddresstheselimits.Turney etal. (2010)surveystheuseofVSMsforsemanticprocessingoftext.TheyorganizetheliteratureonVSMsaccordingtothestructureofthematrixinaVSM.TherearecurrentlythreebroadclassesofVSMs,basedonterm–document,word–context,andpair–patternmatrices,yieldingthreeclassesofapplications.Theysurveyabroadrangeofapplicationsinthesethreecategoriesandwetakeadetailedlookataspecificopensourceprojectineachcategory.TheirgoalinthissurveyistoshowthebreadthofapplicationsofVSMsforsemantics,toprovideanewperspectiveonVSMsforthosewhoarealreadyfamiliarwiththearea,andtoprovidepointersintotheliteratureforthosewhoarelessfamiliarwiththefield.
63WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Beim Retrievalverfahren wird ein Rankingähnlicher Dokumente über dieCosinusÄhnlichkeit im m‐Dimensionalen Vektorraum durchgeführt.
InformationNeedQ→ =( _1"," _2,…"," _ )Wird ein Rankingähnlicher Dokumente über dieCosinus Ähnlichkeit im mdimensionalen VectorSpaceModeldurchgeführt
DerVorteil dieser Methode ist,dass es ein einfaches mathematisches Modelldarstellt,DieMatrizen sind Sparse(ist alsoeine günstige Datenstruktur)Dasretrievalkann inO(n)durchgeführt werden,daher gibt es ein relativ schnellesranking
Nachteile:DieWortanordung geht verloren (BagofWordAnsatz).
Es gibt viele weitere Methoden,wie z.B.LatentSemanticAnalysis(LSA)usw.ProbabilisticLatentSemanticAnalysis(PLSA)LatentDirichlet Allocation(LDA)
64WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
TheadvantagesofthealgebraicVSMincludethatitiseasytounderstand,partialmatchesarepossible,documentscanbesortedbyrank,anditusesterm‐weightingschemes;ontheothersidethereisahighercomputationalefforttocalculatesimilarity,andthe“bag‐of‐words”representationdoesnotaccuratelyconsiderthesemanticsofdocuments(Vallet,Fernández &Castells,2005).
65WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Fortheprobabilisticmodel,theindexweightvariablesareallbinary,i.e.ωij∈[0,1],ωiq∈[0,1].Aqueryqisasubsetofindexterms.LetRbethesetofdocumentsknown(orinitiallyguessed)toberelevant.LetR̅bethecomplementofR(thisisthesetofnon‐relevantdocuments).LetP(R/dj)bedeprobabilitythatthedocumentdj isrelevanttothequeryqandP(R̅/dj)betheprobabilitythatdj isnonrelevanttoq.Thesimilaritysim(dj,q)ofthedocumentdj tothequeryqisdefinedastheratio:
66WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Asinallmodelswehavecertainprosandcons,theprobabilisticmodelhasabigadvantage:thedocumentscanberankedbyrelevance;however,onthedisadvantageoussideitisabinarymodel(binaryweights),theindextermsareassumedtobeindependentandlackofdocumentnormalizationandthereisaneedtoguesstheinitialseparationofdocumentsintorelevantandnon‐relevantsets.
67WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Well, therearetwomainmeasurements
68WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Recall andPrecision– hardasabone
Followingthisdefinition:Recall=Correct/(Correct+Missing)andPrecision=Correct/(Correct+Spurious)
PrecisionPisthefractionofretrieveddocumentsthatarerelevanttothesearch:P=|{setofrelevantdocs}∩{setoffounddocs}|/{setoffounddocs}RecallRisthefractionofthedocumentsthatarerelevanttothequerythataresuccessfullyretrieved:R=|{setofrelevantdocs}∩{setoffounddocs}|/{setofrelevantdocs}Acombinationofprecisionandrecallistheharmonicmeanofboth,whichiscalledF‐measure:F=2∙(P∙R)/(P+R)Inclassification5termsareused:truepositives(=correct);truenegatives(=correct);falsepositives(=spurious);falsenegatives(=spurious);notdetected(=missing).
69WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Inthisslideweseeanoverviewofthelinguisticprocessingpipelinethatdescribesthestepsthatareperformedfromthedocumenttoitssemanticrepresentation.ThedomainknowledgeusedinthesemanticretrievalsystemismodeledintheformofthemedicalsemanticnetworkIDMACSR(MSN).ItusestheWingert Nomenclature(WNC)asitsmedicalterminology.TheWNCisbasedontheGermanversionofSNOMEDdevelopedbyFriedrichWingert.AlthoughitsmainfocusisonGerman,it,toalesserextent,supportsseveralotherlanguagesincludingEnglishandFrench.TheMSNformsasimpleontologywhoseconceptsareorganizedinataxonomy(isA‐hierarchy)andamerology (anatomicalpartOfhierarchy).Furtherrelationsbetweenconceptsaremodeled bylabelededges.TheMSNisdividedintoseveralsubdomains,including:– topography(i.e.,anatomicalconcepts)– morphology(e.g.,fracture,fever)– function(e.g.,respiration)– diseases(e.g.,glaucoma)– agents(e.g.,pathogens,pharmaceuticalsubstances)Currently,theMSNcontainsmorethan90,000termsand300,000uniquerelations.Thequerylanguagefollowsasimplegrammar,namely:Query::=DisjunctionDisjunction::=Conjunction|Conjunction";"DisjunctionConjunction::=Atom|Atom","ConjunctionAtom::=Term|"!"TermThusaqueryformsaBooleanexpressionindisjunctiveformoversearchterms.Semanticqueryexpansionhasbeendiscussedinseveralpreviouswork(Kingsland,Harbourt,Syed&Schuyler,1993),(Aronson,Rindflesch &Browne,1994)(Efthimiadis,1996).Theapproachisasfollows:eachsearchtermisindexed(usingthelinguisticprocessingmethodsdescribedabove)andreplacedbytheidentifieroftheWNCconceptmatchingtheterm.TheseconceptidentifiersarecalledWNCindices.IfthesearchtermreferstoacombinationofseveralconceptsintheWNC(e.g.,Gastroparesis=Stomach+Paresis),thesearchtermisreplacedbyaconjunctionoftheWNC(Kreuzthaleretal.,2011).
70WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
AscanbeseenfromthisSlidethemedicaldomainexpertoutperformstheotherretrievalmethods,achievinghighprecisionatahighrecalllevel.Interestingly,thesemanticbasedinformationretrievaltoolachievesapproximatelythesamerecalllevelasthemedicaldomainexpertwhilehavingalowerprecisionvalue.Thisperformanceresultisgood,rememberingthefactwhateffortthemedicaldomainexperthastomaketotranslatetheinformationneedintoaquerystring.Incontrasttothis,theinputfortheinformationretrievaltoolisshortandclearsothereforelessefforthastobemadetotransformtheinformationneedtothequerylanguageunderstoodbytheinformationretrievaltool.Keywordsearchhasahighprecisionvaluebutalowerrecallvalue.Thisresultisclearwhenconsideringthefactthatinformationneedsthatcanbedescribedbyusingthesekeyword(s)willachieveahighprecisionvalue.So,ifdocumentsarefoundtheywillberelevantbuttherecalllevelwillgenerallysuffer.LookingattheSlide4‐47,keywordsearchachievesapproximatelythesameprecisionasIRToolOnebutafarworserecall.ItisalsopossiblethatnosearchresultsarefoundatallwhenusingthekeywordsearchmethodologyascanbeseenfortheNeubildung,Darm informationneed(seeAppendixBandAppendixA).Incontrasttothis,forthisinformationneed,IRToolOnehasaboutthesameprecisionrecalllevelsasthemedicaldomainexpert,reflectingthesemanticprocessingchainofthetool.TheLSAstatisticalretrievalmethodhas,whencomparedtotheothermethods,alowerprecisionforallmeasuredrecalllevels.ThisresultgivestheimpressionthatLSAisapplicableforgettinghighprecisionvaluesforaparticularamountofsearchresultsbuthardtousetoachievebothhighprecisionandhighrecallvalues,whichisneededforexampleinclinicalstudies.
71WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Thefutureofbigdatais…big andtherewillbemanychallengesforus tosolve!
72WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
The grandquestionsofthefutureishowtomakesenseoutofthedata– megaquestionsincludeare:“Whatisinteresting?”– and“Whatisrelevant?”
73WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
74WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
75WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
76WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Adverse DrugEvents(ADE)areverycommonandthereforetheorderentrymustbetakenspecialcareof.Themedicationordersindifferentmedicationsystems.(a)Kardex system,(b)TIMEDsystem,and(c)CPOEsystem.
Physiciansmustentertheirmedicationordersintothesystem;nursesmaynotacceptanyhand‐writtenprescription.Aphysicianentersamedicationorderbyselectingadruganditsdosageform,strength,administrationroute,dosageregimen,startdateandtime.
77WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
ComparisonshowedthatthemedicationorderingandadministrationprocessaftertheimplementationresemblesthatoftheKardex‐system,whileitiscompletelydifferentfromthatoftheTIMED‐system.InbothKardex andTIMEDunits,wecomparednurseattitudestowardsthecomputerizedprocessinthepost‐implementationphasewiththeirattitudestowardsthepaper‐basedprocessinthepreimplementation phase.
78WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
NO=NoStock
ThemedicationorderingandadministrationprocessesinKardex‐systemandTIMED‐system;MO(MedicationOrder);HIS(HospitalInformationSystem);NS(Non‐Stock);forrequestingurgentNSdrugs,nursesoftendirectlyreferredtothepharmacywithhand‐writtenrequests.
ComparisonofFigs.2and3showsthatthemedicationorderingandadministrationprocessaftertheimplementationresemblesthatoftheKardex‐system,whileitiscompletelydifferentfromthatoftheTIMED‐system.InbothKardex andTIMEDunits,wecomparednurseattitudestowardsthecomputerizedprocessinthepost‐implementationphasewiththeirattitudestowardsthepaper‐basedprocessinthepreimplementation phase.
Thereisnocleardefinitionaboutthis,butitisdefinitelyaboutmanagementofdata,informationandknowledgefordecisionsupport.Letuslookintoapracticalexample– physicianorder– wherealotoferrorshappenedinthepastduetoamessofpaperbasedordersproducingalotofpaperchaos(youallknowthepost‐itsyndrome)
79WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Observationsandresultsofinvestigations—includinghistory,signs,andsymptoms—areconvertedbyclinicalstaffintodecisionsandappropriateactions.Controlusuallyrequirestheuseofrecordsandexternalsourcesofknowledge
Thecareofeachpatientcanbeconsideredtobeacontrolloopin whichdatafromobservationsandinvestigationsleadtodecisionsandactionsdesignedtotakecareofapatient'sproblemsandtheirconsequencesinasafe,effective,andlegitimatemanner.Thisloopoccursinallspecialtiesandisthesourceofalltheactivitiesofahealthcarefacilitysuchasahospital.Thoughcomplex,theseactivitiescanbesetoutasfourconcentricshells.
80WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Theclinicalcontrolloopisatthecoreofacomplexorganisation representedbyfour“shells”thatexchangedata.Activity shellsofclinicalcontrolloop
Clinicalmanagementshell—Assessmentofobservationsandresultsofinvestigations.Formulationsofdecisionsincludingthosebasedonobservations,investigations,andprocedurescarriedoutduringaconsultation
Clinicaladministrativeshell—Administrativeactivitieswhichfacilitatetheclinicalmanagementshellandlinkittotheothershells,suchasarrangingappointmentsandinvestigations,clinicalcorrespondence,filingresults,andclinicalaudit
Clinicalservicesshell—Investigative,therapeutic,andgeneralservicesprovidedbylaboratories,imagingfacilities,therapyunits,operatingtheatres,wards,suppliesdepartments,transport,etc
Generalmanagementshell—Generalmanagementofhealthcare,byhospitalmanagers,financialcontrollers,healthcarepurchasers,andstatutoryauthorities
81WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Exampleofavisualizedinformationsystemarchitecture,hereofthecomputer‐supportedpartofthehospitalinformationsystemoftheMedicalSchoolHanoverfrom1984([1],p.9).
82WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Mayo’sEnterpriseDataModeling(EDM)providesacontextforMayoenterpriseactivities.
83WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Care2x1isagenericmulti‐languageopen‐sourceprojectthatimplementsamodernHospitalInformationSystem.TheprojectwasstartedinMay2002withthereleaseofthefirstbetaversionofCare2xbyanursewhowasdissatisfiedwiththeHISinthehospitalwherehewasworking.Untiltodaythedevelopmentteamhasgrowntoover100membersfromover20countries.Care2xisaweb‐basedHISthatisbuiltuponotheropen‐sourceprojects:theApachewebserverfromtheApacheFoundation(http://www.apache.org/),thescriptlanguagePHP(http://www.php.org/)andtherelationaldatabasemanagementsystemmySQL(http://www.mysql.com/).ThereexistseveralsourcecodebranchesthattrytointegratetheoptiontochoosefromotherRDBMSlikeOracleandpostgreSQL.Thelatteroneisalreadysupportedinthecurrentversionatthetimeofwriting:“deployment2.1”.Forourinvestigationswehavechosenthemostfeature‐richversionthatwasavailablefromtheCare2xwebpageinearlyfall2004.Thisreleasehadtheversionnumber“pre‐deployment2.0.2”.Someminordeficienciesthatwereportlatermayalreadybefixedinthecurrentversion“deployment2.1”.
84WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
This isjusttoshowyouanexampleofaglobaldatabaseschemaEachmolecule(“Molecules”table)mayhavemorethanoneconformation(“Conformations”table)anditmaycomefrommorethanonesource(“Sources”table).Therearetwotypesofexperiments(“Experiments”table)thataredoneonmolecules:computationaldockingandbiologicalassays.Theresults(“DockingResults”and“AssayResults”tables)oftheseexperimentswerecapturedinthedatabase.Eachtypeofexperimentisdoneonaparticularp53mutant(“Mutants”table)andhasascore(“Scores”table)associatedwithit.
85WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
OpenDatabase Connectivity– APIinCforaccessingDBMSSystemarchitectureandthehybridstrategytodataintegration.Dockingandsmallmoleculedatausethemediationapproach,whilethefunctionalandstructuralassaydatausethedatawarehousingapproach.TheCRDBisbothamediatorandadatawarehouse.“Mutants”and“Molecular”aredatamartsofthewarehouse.TheODBCdriversarewrappersinthemediationapproach.Dashedlinesindicateintegrationplannedinthefuture.
86WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Theatomiccoordinatesofaproteinaredepositedintotheproteindatabase(PDB),aninternationalrepositoryfor3Dstructurefiles.AtthemomentPDBcontainsmorethan26.000proteinstructures
87WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
Wewilldealwithvisualizationsinlecture9– herejustanappetizerwhatyoucandisplay
Thisshows acervicalcancerqueryvisualization.TheGenenodesarepositionedusingbothchromosomenumberandorganismname.ThispositioningmethodallowsuserstofocusonaparticulargeneandspeciesusingNVSS’ssliderfilters.Nodesaresize‐codedaccordingtotheirindegree,whichprovidesanadditionalvisualcueaboutthenode’simportance.
88WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
89WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
90WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
91WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
http://psychology.wikia.com/wiki/Information_retrievalhttp://www.eecs.wsu.edu/mgd/gdb.html(GraphDatasets)
92WS 2015
A. Holzinger LV 709.049 Mi, 04.11.2015
top related