[ieee 2011 24th international symposium on computer-based medical systems (cbms) - bristol, united...

Predictions in Antibiotics Resistance and nosocomial infections monitoring.

Mary GerontiniAthens University of Economics

Department of Informatics76, Patission Str., GR10434 Athens-Greece

[email protected]

Michalis Vazirgiannis∗

Athens University of EconomicsDepartment of Informatics

[email protected]

Alkiviadis C. VatopoulosNational School of Public Health, Athens, Greece

Department of Microbiology196, Alexandras STR.,GR11521 Athens-Greece

[email protected]

Michalis PolemisNational School of Public Health, Athens, Greece

Department of [email protected]

Abstract

Nosocomial infections and antibiotic resistance are re-garded as critical issues both in clinical medicine as wellas in Public health, thus understanding their epidemiol-ogy is a priority in the health sector. Our research aims atdemonstrating that data mining techniques, such as regres-sion, classification and association rules and assist in dis-covering interesting patterns in the epidemiological trendsof antibiotic resistance in Greek Hospitals. In this work,we present a novel framework which integrates data frommultiple hospitals and discovers association rules storedin a data warehouse. Furthermore, this data warehouse isused as a source for extracting interesting and valid predic-tions by applying techniques such as regression and clas-sification. Our system is fully operational and treats real-world data from the WHONET, a software installed on themajority of Greek member hospitals of the ”Greek Systemfor Surveillance of Antimicrobial Resistance” network. Thecontributions of the proposed framework are i. a standard-ized workflow for the seamless integration of data producedin various hospitals into a consistent data warehouse andb. the use of a mechanisms to predict hidden future be-havior on large datasets, using regression and classifica-

∗Prof. M. Vazirgiannis is partially supported by the DIGITEO Chairgrant LEVETONE in France and the Research Centre of the Athens Uni-versity of Economics and Business, Greece

tion algorithms, which can provide significant surveillancewarnings.

1. Introduction

The increasing rate of isolation during the last years ofbacteria which are resistant to antibiotics in hospitals andthe society is considered a main Public Health threat inmany parts of the world. The respective infections are diffi-cult to be treated since few antibiotics remain active againstthe respective infectious agent. Moreover, these antibioticsare expensive and in few accessions toxic and pharmacoki-netically/pharmacodynamically less appropriate.

Antibiotic resistance is the result of various genetic alter-ations in the bacterial cell such as mutations of the target siteof the antibiotic the acquisition of efflux pumps, and moreimportantly the acquisition through horizontal gene transferamong bacteria of genes, encoding for enzymes that destroythe respective antibiotic, such as beta lactamase etc.In thatrespect the epidemiology of antibiotic resistance is the re-sult of the spread of pathogenic bacteria evolving throughmutations and or acquisition of genes.

Surveillance of antibiotic resistance is based on moni-toring the aforementioned mobility of bacteria, as well asof genes, and is carried out either through the collectionand study of representative bacterial isolates or through the

analysis of routinely collected data from the microbiologylaboratories of the hospitals.

In Greece, a national network for continuous monitor-ing of bacterial antibiotic re-sistance in the Greek hospitals(Greek System for the Surveillance of Antimicrobial Resis-tance) is in place since 1995 [6]. Its function is based on theassumption that the routine results of the antibiotic sensitiv-ity tests performed daily in each hospital clinical laboratoryshould be considered as a major resource for antibiotic re-sistance surveillance.

Moreover, and since the quality and compatibility ofthese data are in principle uncertain, our approach is towork in parallel, on both accessing the data and assess-ing its quality. This is accomplished through the establish-ment of a quality control procedure and the adaptation of asource code and data format in all hospitals through the useof the Whonet software[7], originally developed by WHOCollaborating Centre for Surveillance of Antibiotic Resis-tance in Boston USA and further developed in the Divisionof Emerging and other Communicable Diseases Surveil-lance and Control, WHO (WHO/EMC), Geneva, Switzer-land [7][8].

The WHONET software is adopted as a common soft-ware platform, due to its friendly, flexible features to ex-panding the pyramidal reporting structure and capability tointerface with other statistical packages and programs.Datais being collected from all sources every 6 months, ana-lyzed and relevant reports are published in the respectiveWeb site: www.mednet.gr/whonet. However the complex-ity of the antibiotic resistance phenomenon, the fact that itinvolves many bacterial species, evolving bacterial clonesand horizontally transferred genes, gave rise to the pursueof techniques for further analyzing these data, in order toreveal hidden associations, time trends and time/space clus-tering, important for an effective strategy to confront the an-tibiotic resistance epidemic. For the above reason numerousData Mining Algorithms have been recently used to extractknowledge from large databases. [3][4][5].

Since traditional manual activities such as antibiogramsummaries are proven to be time consuming, the produc-tion of measures and patterns is often not up-to-date andmany useful patterns remain undiscovered.For these rea-sons, we designed and developed a web based frameworkthat contributes to the antibiotics’ resistance surveillancewhile identifying outbreaks in antibiotic resistance and ap-plying extensive analysis of hospital data. In addition tothis, many other studies [2][3] data. Our system i. supportsdata collection from multiple hospitals via a user-friendlyinterface, with data noise cleaning capabilities. The dataare stored in a central data warehouse. ii. Based on state ofthe art data mining algorithms (such as: association rules,Support Vector Machines and Linear Regression) extractuseful previously unknown patterns to build antibiotic sen-

sitivity prediction models as well as nosocomial infectionsforecasting and iii. An advanced visualization and repost-ing mechanism via customizable graphs in order to instantlypresent critical information to experts and thus, make datamanagement and decision making easier and more effec-tive.

The paper is organized as follows: in Section 2 wepresent the architecture of our model and data format, inSections 3 we present the results of our a many-fold dataanalysis, while in Section 4 we conclude with a brief sum-mary and further research directions.

2 System Model

Our system is a web based framework which managesthe collection and integration of incoming public health datafrom multiple hospitals. While in a previous work [2] wereport on a system to extract association rules, here we ex-tend this idea by providing new features for prediction andvisualization of data capitalizing on these association rules.Specifically, we provide visualization of the temporal valid-ity of these rules. We also attempt predictions regarding a.the future validity of the rules and b. the antibiotic resis-tance of bacteria based on the public health data stored in adata warehouse.

2.1 Data

The data set used in this work were collected via theWhoNet system during the period 2003-2009. The data arestored, preprocessed, cleaned and formatted with the usageof an interface which we developed for this purpose. Herewe discuss the attributes we use in our analysis regardingthe bacterial strain and a sensitivity test, isolated from apatient contatining the following features: organism group(bacterial species), specimen group, department the patientwas hospitalized, the period of time the strain has been iso-lated (we use trimesters in our implementation) and the re-sistance at antibiotics which have been tested to the specificbacteria - see Table 1 for a summary of the data we capital-ize on. The data set consisted of 1768 training instances ofthe data (association rules retrieved), including 442 organ-ism groups, 53 specimen groups, 56 hospitals in Greece and41 types of antibiotics.

2.2 Extraction of Association Rules

An initial data mining step involved the extraction of as-sociation rules (a very popular technique to discover corel-lations) representing non obvious relations and hidden pat-terns in public health data. The produced rules are aggre-gated and stored in an appropriate warehouse which pro-vides easy access to the them. The algorithm which has

Name TypeSpecimen Group wound,blood,urineHospital GR0001,GR002,...Period 1-3/2003,4-7/2003,?Department icu,meth,out,...Organism Group E. Coli,...Antibiotic Resistance Resistant, Intermediate,Sensitive,...

Table 1. Data attributes and values

been used to produce the association rules is the Apriori [2][3] and its pseudo code described in figure 1. The extractedrules have the specific format

Specimen, Hospital, Department, Period, Organism →AntibioticResistanceor

Specimen, Hospital, Department, Period,→ Organismand are used for further statistical analysis in order to pre-dict the future behavior of these rules. A sample from theextracted rules are displayed on Tables 2 and 3.

Figure 1. Apriori Algorithm

LHS RHSSpecimen Hospital Depart. Period Pathogen

Genital GR61 out 4-6/2003 Strept.

Table 2. Sample for first type of extracted as-sociation rules

2.3 Prediction methods design

The temporal dimension of the nosocomial infectionsand antibiotic resistance data is a critical one. We attempt,

LHS RHSSpc. Hosp. Dep. Per. Path. Antib. Res.UR GR61 out 1-6/05 E.coli CLI R

Table 3. Sample for second type of extractedassociation rules

via state of the art data mining methods, to predict the va-lidity of the extracted association rules. The techniques weused include: time series analysis for statistical analysis andSupport Vector Machines. Hereafter we elaborate on the us-age and results of those methods.

2.3.1 Linear Regression

In most hospitals, it is vital to forecast the trends in the iso-lation rate of a variety of pathogens and the antibiotic resis-tance. In that respect we used the association rules whichwere extracted before in order to check their temporal valid-ity and their future behavior. In this work we used two basicinterestingness measures, confidence and leverage, whichas mentioned before provide measures on the interesting-ness and validity of the specific rule.

Assume a set of n association rules for which we ob-serve the leverage and the confidence values of m time’ssteps as the most interesting measures for association rules.Let y1i = (x1i1, ...., x1im) be the leverage values of theith rule at the time points t = (t1, ..., tm) and x2i =(x2i1, ..., x2im) be the confidence values of the ith rule atthe time points t = (t1, ..., tm). Further, we assume thatthe n x m design matrix X1 stores all the observed lever-age values and n x m design matrix X2 stores all the ob-served confidence values such that each row corresponds toa rule and each column to a time point. Given these observa-tions we aim to predict the leverage X1i(∗) and confidencevalue X2i(∗) for each rule ith at some time t∗. t∗ will typ-ically correspond to a future time point, i.e. t∗ > ti, withi = 1, ...,m. We now discuss discuss a simple predictionmethod, based on linear regression, where the input variablecorresponds to time and the response variable is the lever-age or the confidence value. The general linear regressionequation for a line that fits data is

x = a+ b ∗ t

where t the independent variable - time (represented by theid of the respective trimester), x the dependent variable(confidence or leverage) and a, b are the constant regres-sion parameters that must be computed to optimally fit aline to the available data points. The a, b parameters aredetermined based on the following equations :

a =(∑x)(∑

t2)− (∑t) (∑tx)

(n) (∑t2)− (

∑t)2

b =(n) (

∑tx)− (

∑t) (∑x)

(n) (∑t2)− (

∑t)2

2.3.2 Classification

The second method for prediction we employ in our frame-work is the SVM classification algorithm aiming to predictthe antibiotic resistance of certain organisms in hospitalsseasonally. A summary of the attributes used can be seen inTable. 1.

Our effort to deal with the classification problem utilizesthree classifiers. However, after an extensive series of ex-periments the Support Vector Machine algorithm presentedthe best classification results.

Support Vector Machines (SVM) are learning predictorsbased on the Structural Risk Minimization (SRM) principlefrom statistical learning theory. The SRM principle seeks tominimize an upper bound of the generalization error ratherthan minimizing the training error (Empirical Risk Mini-mization). This approach results in better generalizationthan conventional techniques based on the ERM principle[4].

Consider an n-dimensional object x which has n coor-dinates x = (x1, x2, x3, ?, xn), where each xi is a realnumber xiεR for i = 1, 2, ..n. Each object xj belongsto a class yjε[−1,+1]. Furthermore, we have a train-ing set T of m objects together with their classes, T =(x1, y1), (x2, y2), ?, (xm, ym). A dot product space S in-cludes the objects x and are embedded x1, x2, ..xmεS. Anyhyperplane in the space S can be written as (xεS|w•x+b =0). The dot product w • x is defined by:

w • x =

k∑i=1

wixi

A training set of objects is linearly separable if there existsat least one linear classifier defined by the pair (w, b) whichcorrectly classifies all training objects. This linear classifieris represented by the hyperplane H (w • x + b = 0) anddefines a region for class +1 objects (w • x + b > 0) andanother region for class -1 object (w • x+ b < 0).

After training, the classifier is ready to predict the classmembership for new objects, different from those used intraining. The class of an object xk is determined with theequation:

class(xk) =

{+1 if w • xk + b > 0−1 if w • xk + b < 0

Therefore, the classification of new objects depends onlyon the sign of the expression w • x + b. In our implemen-tation objects x are association rules in a specific form andwith them we can predict the future resistance of specificpathogen organisms.

3 Experimental Methodology

3.1 Experimental protocol

A very important concept in machine learning and datamining is the overfitting issue occurring when a model istoo perfectly fit to a limited set of training data points. Thenthe resulting model cannot predict to satisfatory degree forunknown data and thus, the accuracy of the model is low.For these reasons, there are many techniques to tackle thisissue. One of them is the cross validation technique whichwe used in order to produce an accurate prediction modeland not waste data for testing.

In cross validation we divide the data into k folds. Foreach fold, we use the whole data set excluding the currentone as a learning set and the rest data is being used as atest set. The mean error on each fold gives a low biasedestimator. In our implementation we used the 10-fold vali-dation once many references describe that accuracy differ-ences for additional folds are insignificant[10]. On the otherhand, in the linear regression approach we chose the follow-ing method to avoid overfitting. The user defines the num-ber of T time points ranging inside the interval: [3, n − 1),where n is the last (most recent) timestamp in our data. Thechoice to use the aforementioned range was taken due tothe fact that, after extensive experimentation, a stable pre-diction model could only be extracted when having threeor more observations for confidence/leverage-timeslot mea-surements per rule. In addition, the most accurate modelextracted consisted of m observations, where m is equal to(timestamps− 1).

In addition to this, we measure the accuracy of our modelthrough a variety of statistical measures such as TP Rate,FP Rate, Precision, Recall, F-measure. TP is the number ofitems correctly labeled the proper class, FP is the numberof objects falsely classified to the proper class. Precisionis a measure of exactness while recall is a measure of com-pleteness and finally, F-measure is the weighted harmonicmean of precision and recall.

3.2 Experiments and Results and Goals

3.2.1 Linear Regression Analysis

Regarding the association rules, we experimented with allthe pathogen microorganisms as members of the RHS inrules and the results are comparable. Due to lack of spacewe report only the results on Esc. Coli that are representa-tive of the whole result set. As we can see in Figure 2 thereare some repeated patterns for the specific association rulesand outbreaks with regards to the confidence and leverage.From this, we can infer that in certain repeated periods thereis a high isolation of this pathogen organism (the pathogen

Esc.Coli in our case) so we should be prepared to avoid thespread of this organism. We claim hospitals could benefitfrom this framework and can exploit all these observationsin order to prevent hospital-acquired infections. Figure 2presents some of our predictions. It is clear from the resultsthat predicted values for confidence and leverage are veryclose to the real values and assure thus a robust predictionframework in this context. The prediction error is calculatedas follows: error = |expectedvalue− predictedvalue| .

Figure 2. confidence and leverage valueswhich produced via the Apriori Algorithm. Aswe can see at the values there are some re-peated peaks of the leverage and confidenceamong the time which inform us about highSensitivity(S) at the antibiotic AMC of theE.coli organism (eco).

Regarding the linear regression method, we used ten (10)time points in order to determine the curve (y = ax+ b seesection 2.3) and we predicted six (6) time points based oncurve which calculated before. In the graph of fig.3 is illus-trated the amount of prediction error and is indicated howlow is for each rule. Through the Linear regression we canpredict the future importance of the rule and as a result fore-cast outbreaks to the presence of a pathogen organism. Forexample, If a rule like: Urine, Gr0061, out, t1→ E.coli oc-curred with a confidence near the value 1 we could inferredthat in the urine at Gr0061 and in the outpatient departmentand at the time point t1 Escherichia coli is isolated in a ratehigher than expected. Furthermore, due to the visualization

Figure 3. Regression Error rate for predictedconfidence and leverage values

of the data we can observe some repeated trends and pat-terns during the months.

3.2.2 Classification

Valid and reliable automatic disease classifiers are consid-ered as vital components of a antibiotic resistance moni-toring system. In our work we measured the actual perfor-mance of three classifiers (Naive Bayes, SVM, C4.5, imple-mented in the open source library of Weka 3.7) designed toearly detect special cases of antibiotic resistance that haveregularly occurred often in hospitals. We formulated a clas-sification problem aiming to predict the antibiotic resistanceof a pathogen based on data concerning the following at-tributes: hospital, specimen group, department of a hospital,the pathogen organism and the respected season. The accu-racy results of our implementation are shown in Table 4. Aswe can see all three algorithms have similar results accord-ing to the measures mentioned above. However, SupportVector Machines has achieved the best results in compar-ison to the rest of the algorithms according to F-measurewhich is essential for distinguishing accurate from inaccu-rate structures. Furthermore, TP- rate values are consider-ably high which means that all three algorithms can predictcorrectly pathogen organisms which may observed in a hos-pital. With, this type of predictions it is feasible to forecastpossible diseases that could be acquired during the upcom-ing trimester and the resistance on antibiotics for these dis-eases.

These algorithms models are being trained on historicaldata stored in our data warehouse and a prediction on the an-tibiotic resistance of the pathogen organism is made for thenext trimester. For a example for a given organism, spec-

imen group, antibiotic, hospital and season we can predictwith 98 percent accuracy the antibiotic resistance for thespecific organism. In Table.2 we illustrate a prediction ac-curacy of each aforementioned algorithm.

Measure Model Naive Bayes C4.5 SVMTP Rate 0.946 0.935 0.978FP Rate 0.469 0.78 0.157Precision 0.942 0.915 0.978Recall 0.946 0.935 0.978F-measure 0.943 0.919 0.978

Table 4. Prediction Quality for Escerichia coli

The results are apparently very attractive as all the usedmeasures reveal a quite precise prediction rate for all mea-sures and algorithms. In most of the cases the Support Vec-tor Machines algorithm gives the best prediction results.

4 Conclusion and Discussion

Surveillance of nosocomial infections as well as antibi-otic resistance are two of the most important functions ofa hospital infection control program. In public health andmore specifically in surveillance of antibiotic resistance, itis important to discover new associations and patterns be-fore they become widely spread in a hospital or a region.Furthermore, is real important to predict future behaviorfrom epidemic data in order for hospitals to be preparedfor outbreaks at the isolation of pathogen organisms. Inthis paper we have presented a fully functional and im-plemented framework for predictions and visualization forin this context. The systems capitalize on the real worlddata of the Greek national network for continuous monitor-ing of bacterial antibiotic resistance in the Greek hospitals(Greek System for the Surveillance of Antimicrobial Re-sistance) in place since 1995 [6]. We achieved robust andaccurate predictions that are quite promising in terms of bet-ter understanding the problem and patterns of Nosocomialinfections. Moreover the system offers a friendly interfacewhich could be used by people who are not data mining ex-perts. The results were achieved using data with patientsover the last seven years. Finally, future work will be de-voted in using larger data set collections, spanning proac-tive time periods. Likewise, infection control systems re-quire or will require data mining tools such as clusteringfor further research about future trends. The system im-plementation and full functionality is available on line athttp://195.251.235.83/en/index.html .

References

[1] R. P. Trueblood, J. N. Lovett,Jr., Data Mining and Sta-tistical Analysis Using SQL, Apress, Berkeley, Cali-fornia, 2001.

[2] Eugenia G. Giannopoulou, V. P. Kemerlis, MichalisPolemis, J. Papaparaskevas, Alkiviadis C. Vatopou-los, Michalis Vazirgiannis, A Large Scale DataMining Approach to Antibiotic Resistance Surveil-lance, cbms, pp.439-444, Twentieth IEEE Interna-tional Symposium on Computer-Based Medical Sys-tems, 2007.

[3] Mykola Pechenizkiy, Alexey Tsymbal, Seppo Pu-uronen, Michael Shifrin, Irina Alexandrova, Knowl-edge Discovery from Microbiology Data: Many-SidedAnalysis of Antibiotic Resistance in Nosocomial In-fections, in: WM05, 3rd International Conferenceon Professional Knowledge Management: Experienceand Visions, Kaiserslautern, Germany, pp. 360-372,April 2005.

[4] G. Cohen, M. Hilario, H. Sax, S. Hugonnet, C. Pel-legrini, A. Geissbuhler, An Application of One-ClassSupport Vector Machines to Nosocomial Infection De-tection, in: In Proc. of Medical Informatics, 2004.

[5] Brossette SE, Sprague AP, Jones WT, et al. A datamining system for infection control surveillance.Methods Inf Med 2000;39:303-10.

[6] Vatopoulos AC, Kalapothaki V, Legakis NJ. An elec-tronic network for the surveillance of antimicro-bial resistance in bacterial nosocomial isolates inGreece. The Greek Network for the Surveillance ofAntimicrobial Resistance. Bull World Health Organ.1999;77:595-601

[7] O’Brien TF, Stelling JM. WHONET: an informationsystem for monitoringantimicrobial resistance. EmergInfect Dis. 1995;1:66.

[8] Stelling JM. WHONET: removing obstacles to the fulluse ofinformation about antimicrobial resistance. Di-agn Microbiol Infect Dis. 1996;25:162-8.

[9] Samore M, Lichtenberg D, Saubermann L, KawachiC, Carmeli Y. A clinical data repository enhances hos-pital infection control. Proc AMIA Annu Fall Symp.1997:56?60.

[10] Sterlin, P. Overfitting prevention with cross-validation. Master?s thesis. University Pierre andMarie Curie (Paris VI): Paris, France, 200

[ieee 2011 24th international symposium on computer-based medical systems (cbms) - bristol, united...

Documents