428 ieee/acm transactions on computational biology …ksleung/download_papers/tccb_8(2... ·...

13
Data Mining on DNA Sequences of Hepatitis B Virus Kwong-Sak Leung, Kin Hong Lee, Jin-Feng Wang, Eddie Y.T. Ng, Henry L.Y. Chan, Stephen K.W. Tsui, Tony S.K. Mok, Pete Chi-Hang Tse, and Joseph Jao-Yiu Sung Abstract—Extraction of meaningful information from large experimental data sets is a key element in bioinformatics research. One of the challenges is to identify genomic markers in Hepatitis B Virus (HBV) that are associated with HCC (liver cancer) development by comparing the complete genomic sequences of HBV among patients with HCC and those without HCC. In this study, a data mining framework, which includes molecular evolution analysis, clustering, feature selection, classifier learning, and classification, is introduced. Our research group has collected HBV DNA sequences, either genotype B or C, from over 200 patients specifically for this project. In the molecular evolution analysis and clustering, three subgroups have been identified in genotype C and a clustering method has been developed to separate the subgroups. In the feature selection process, potential markers are selected based on Information Gain for further classifier learning. Then, meaningful rules are learned by our algorithm called the Rule Learning, which is based on Evolutionary Algorithm. Also, a new classification method by Nonlinear Integral has been developed. Good performance of this method comes from the use of the fuzzy measure and the relevant nonlinear integral. The nonadditivity of the fuzzy measure reflects the importance of the feature attributes as well as their interactions. These two classifiers give explicit information on the importance of the individual mutated sites and their interactions toward the classification (potential causes of liver cancer in our case). A thorough comparison study of these two methods with existing methods is detailed. For genotype B, genotype C subgroups C1, C2, and C3, important mutation markers (sites) have been found, respectively. These two classification methods have been applied to classify never-seen-before examples for validation. The results show that the classification methods have more than 70 percent accuracy and 80 percent sensitivity for most data sets, which are considered high as an initial scanning method for liver cancer diagnosis. Index Terms—Data mining, DNA sequences of HBV, mutation sites, nonlinear integrals, rule learning, the signed fuzzy measures. Ç 1 INTRODUCTION I N Asia, infection of Hepatitis B virus (HBV) is a major health problem. At least 10 percent of the Chinese population (120 million people) are HBV carriers, and up to 25 percent of HBV carriers will die as a result of HBV- related complications including liver cirrhosis and hepato- cellular carcinoma (HCC), i.e., liver cancer. Chronic infection by the HBV causes an increased risk of hepato- cellular carcinoma (HCC) by more than 100-fold [1]. The relationship between HBV genotype and viral mutation with hepatocarcinogenesis is controversial. A case control study from Taiwan suggested that genotype C HBV is more closely associated with cirrhosis and HCC in those who are older than 50 years, whereas genotype B is more common in patients with HCC aged less than 50 years [2]. Our previous cohort study of 426 cases of chronic hepatitis B patients also reviewed a higher risk of HCC and liver cirrhosis in genotype C infection [3]. On the other hand, reports from Japan and China did not confirm the higher malignant potential of genotype C HBV [4], [5]. The aim of this study is to find the genomic markers of the HBV and clinical information which are useful in predicting occur- rence of liver cancer and response to therapy. In this study, we look into the clinical data prepared by the clinicians, and the HBV DNA genomes prepared by the biochemists of our research group [6], [7]. Patients taking part in this study were selected by the clinicians carefully, according to their age, sex, and past clinical status. Chronic hepatitis B patients recruited since 1997 were prospectively followed up for the development of HCC for avoiding selection bias. HCC was diagnosed by a combination of alpha fetoprotein, imaging, and histology. Liver cirrhosis was defined as ultrasonic features of cirrhosis together with hypersplenism, ascites, varices, and/or encephalopathy [3]. Clinical attributes for analysis were chosen by clinicians based on their expert knowledge. Primer Express software version 2.0 (PE applied Biosystems, Foster City, CA) was used to find suitable primers and probes. TaqMan real-time 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 2, MARCH/APRIL 2011 . K.S. Leung, K.H. Lee, and J.F. Wang are with the Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong. E-mail: {ksleung, khlee, jfwang}@cse.cuhk.edu.hk. . E.Y.T. Ng’s contact information is not available. . H.L.Y. Chan is with the Department of Medicine and Therapeutics, The Chinese University of Hong Kong, 9/F Prince of Wales Hospital, Shatin, Hong Kong. E-mail: [email protected]. . S.K.W. Tsui is with the Department of Biochemistry (Medicine), Institute of Science and Technology, The Chinese University of Hong Kong, Shatin, NT, Hong Kong. E-mail: [email protected]. . T.S.K. Mok is with the Department of Clinical Oncology, The Chinese University of Hong Kong, Shatin, NT, Hong Kong. E-mail: [email protected]. . P.C.H. Tse is with the Department of Medicine and Therapeutics, Room 415, CC, PWH, Shatin, NT, Hong Kong. E-mail: [email protected]. . J.J.Y. Sung is with the Department of Medicine and Therapeutics, Room 114009, 9/F, Clinical Sciences Building, Prince of Wales Hospital, Shatin, NT, Hong Kong. E-mail: [email protected]. Manuscript received 23 June 2008; revised 4 Dec. 2008; accepted 7 Jan. 2009; published online 14 Jan. 2009. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2008-06-0116. Digital Object Identifier no. 10.1109/TCBB.2009.6. 1545-5963/11/$26.00 ß 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Upload: others

Post on 02-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

Data Mining on DNA Sequencesof Hepatitis B Virus

Kwong-Sak Leung, Kin Hong Lee, Jin-Feng Wang,

Eddie Y.T. Ng, Henry L.Y. Chan, Stephen K.W. Tsui, Tony S.K. Mok,

Pete Chi-Hang Tse, and Joseph Jao-Yiu Sung

Abstract—Extraction of meaningful information from large experimental data sets is a key element in bioinformatics research. One of

the challenges is to identify genomic markers in Hepatitis B Virus (HBV) that are associated with HCC (liver cancer) development by

comparing the complete genomic sequences of HBV among patients with HCC and those without HCC. In this study, a data mining

framework, which includes molecular evolution analysis, clustering, feature selection, classifier learning, and classification, is

introduced. Our research group has collected HBV DNA sequences, either genotype B or C, from over 200 patients specifically for this

project. In the molecular evolution analysis and clustering, three subgroups have been identified in genotype C and a clustering method

has been developed to separate the subgroups. In the feature selection process, potential markers are selected based on Information

Gain for further classifier learning. Then, meaningful rules are learned by our algorithm called the Rule Learning, which is based on

Evolutionary Algorithm. Also, a new classification method by Nonlinear Integral has been developed. Good performance of this method

comes from the use of the fuzzy measure and the relevant nonlinear integral. The nonadditivity of the fuzzy measure reflects the

importance of the feature attributes as well as their interactions. These two classifiers give explicit information on the importance of the

individual mutated sites and their interactions toward the classification (potential causes of liver cancer in our case). A thorough

comparison study of these two methods with existing methods is detailed. For genotype B, genotype C subgroups C1, C2, and C3,

important mutation markers (sites) have been found, respectively. These two classification methods have been applied to classify

never-seen-before examples for validation. The results show that the classification methods have more than 70 percent accuracy and

80 percent sensitivity for most data sets, which are considered high as an initial scanning method for liver cancer diagnosis.

Index Terms—Data mining, DNA sequences of HBV, mutation sites, nonlinear integrals, rule learning, the signed fuzzy measures.

Ç

1 INTRODUCTION

IN Asia, infection of Hepatitis B virus (HBV) is a majorhealth problem. At least 10 percent of the Chinese

population (120 million people) are HBV carriers, and upto 25 percent of HBV carriers will die as a result of HBV-related complications including liver cirrhosis and hepato-cellular carcinoma (HCC), i.e., liver cancer. Chronicinfection by the HBV causes an increased risk of hepato-cellular carcinoma (HCC) by more than 100-fold [1]. The

relationship between HBV genotype and viral mutationwith hepatocarcinogenesis is controversial. A case controlstudy from Taiwan suggested that genotype C HBV is moreclosely associated with cirrhosis and HCC in those who areolder than 50 years, whereas genotype B is more commonin patients with HCC aged less than 50 years [2]. Ourprevious cohort study of 426 cases of chronic hepatitis Bpatients also reviewed a higher risk of HCC and livercirrhosis in genotype C infection [3]. On the other hand,reports from Japan and China did not confirm the highermalignant potential of genotype C HBV [4], [5]. The aim ofthis study is to find the genomic markers of the HBV andclinical information which are useful in predicting occur-rence of liver cancer and response to therapy.

In this study, we look into the clinical data prepared bythe clinicians, and the HBV DNA genomes prepared by thebiochemists of our research group [6], [7]. Patients takingpart in this study were selected by the clinicians carefully,according to their age, sex, and past clinical status. Chronichepatitis B patients recruited since 1997 were prospectivelyfollowed up for the development of HCC for avoidingselection bias. HCC was diagnosed by a combination ofalpha fetoprotein, imaging, and histology. Liver cirrhosiswas defined as ultrasonic features of cirrhosis together withhypersplenism, ascites, varices, and/or encephalopathy [3].Clinical attributes for analysis were chosen by cliniciansbased on their expert knowledge. Primer Express softwareversion 2.0 (PE applied Biosystems, Foster City, CA) wasused to find suitable primers and probes. TaqMan real-time

428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 2, MARCH/APRIL 2011

. K.S. Leung, K.H. Lee, and J.F. Wang are with the Department of ComputerScience and Engineering, The Chinese University of Hong Kong, Shatin,NT, Hong Kong. E-mail: {ksleung, khlee, jfwang}@cse.cuhk.edu.hk.

. E.Y.T. Ng’s contact information is not available.

. H.L.Y. Chan is with the Department of Medicine and Therapeutics, TheChinese University of Hong Kong, 9/F Prince of Wales Hospital, Shatin,Hong Kong. E-mail: [email protected].

. S.K.W. Tsui is with the Department of Biochemistry (Medicine), Instituteof Science and Technology, The Chinese University of Hong Kong, Shatin,NT, Hong Kong. E-mail: [email protected].

. T.S.K. Mok is with the Department of Clinical Oncology, The ChineseUniversity of Hong Kong, Shatin, NT, Hong Kong.E-mail: [email protected].

. P.C.H. Tse is with the Department of Medicine and Therapeutics, Room415, CC, PWH, Shatin, NT, Hong Kong. E-mail: [email protected].

. J.J.Y. Sung is with the Department of Medicine and Therapeutics, Room114009, 9/F, Clinical Sciences Building, Prince of Wales Hospital, Shatin,NT, Hong Kong. E-mail: [email protected].

Manuscript received 23 June 2008; revised 4 Dec. 2008; accepted 7 Jan. 2009;published online 14 Jan. 2009.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2008-06-0116.Digital Object Identifier no. 10.1109/TCBB.2009.6.

1545-5963/11/$26.00 � 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Page 2: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

PCR technology was used to differentiate the nucleotidevariant [6]. Because the focus of this paper is on the study ofdata mining techniques, the selection process and criteria ofpatients and the research experiments run by our Biochem-istry Department will not be discussed in detail.

In [8], HBV DNA sequences were taken from 13 patients.Keum et al. [9] amplified a conserved core region and asurface antigen region of HBV DNA by PCR from sera of27 Korean chronic hepatitis B patients for detecting hepatitisB virus mutants. Our project is one of the biggest HBV DNAfull-sequence collection and analysis studies of its kind. Wehave collected DNA sequences from 98 Control (normal) and100 HCC (cancer) patients specifically for this project. TheDNA sequences of HBV are not exactly the same for eachgroup, and they possess some individual nucleotide muta-tions that may or may not be related to HCC. From previousstudies, HBV can be divided into seven genotypes where eachof them has more than 8 percent difference of nucleotidesfrom the others. In Hong Kong, genotypes B and C are thepredominant types, and all the examples we have are of thesetwo genotypes. To reduce the noise of genotypic differenceamong the sequences collected, we propose to analyze theseDNA examples in each genotype separately.

Classification is one of the most studied data miningtasks. The objective is to predict the value (the class) of auser-specified goal attribute based on the values of otherattributes, called the predictive (feature) attributes. The goalattribute might be the prediction of whether or not a patienthas cancer, while the predictive feature attributes might bethe mutation sites of the patient’s virus DNA.

The focus of this study is to identify genetic marker(s) forliver cancer (HCC) from HBV DNA sequences. There aresimilar medical research reports, but all of them are focusedon the specific gene positions, proteins or part of a virusgenome. However, our project is the first study on thecomplete viral genome. One of the past researches is an HIVgenomic study [10]. The researchers align each DNAsequence with a reference sequence, and then select thegenes using their expert knowledge, and use Decision Treeand Support Vector Machine (SVM) for analysis. In [11], theresearchers focused on the identification of HBV DNAsequences that are predictive of response to one therapy.Some sites in sequences were observed to have caused theeffect. Chan et al. studied the risk factors in HBV sequenceswith respect to medicine [3], [6]. Here, we apply softcomputing tools to predict positive patients and analyze theeffective mutation sites in the HBV DNA sequences.

The aim of this study is to develop a data miningframework which contains an appropriate classifier forliver cancer based on HBV DNA and clinical data. Wedevelop two new algorithms based on rule learning (RL)and nonlinear Integral (NI). We then carry out a thoroughcomparative study on these two new models with existingclassifiers. The classification model should have highsensitivity and acceptable accuracy and specificity forHCC diagnosis and prediction. The model learned shouldalso give clear indication of the degrees of influence of theattributes toward the classification goal and whether thereare any interactions among the predictive attributes. In thispaper, we identified the important mutation sites (mar-kers) in the HBV sequences that could have caused or

been related to liver cancer. We use information entropyfor finding genetic markers of HCC in the HBV genomedata and propose a new classification model based onnonlinear integrals.

This paper is organized as follows: Section 2 describesthe data mining framework which includes the new rulelearning and the nonlinear integral classification models indetail. All the methods and data sets used in this project aredetailed in Section 3. The experimental results and thecomparative studies are presented in Section 4. Section 5concludes with the summary and the discussion of somedirections for future work.

2 DATA MINING FRAMEWORK

The data mining framework developed is shown in Fig. 1.There are nine modules. After the molecular evolutionaryanalysis, the data are passed to the Clustering Module tocheck whether clusters exist based on the phylogenetic treeanalysis. If clusters are found, each cluster will be analyzedseparately for potential genetic marker sites because it willminimize the noise produced by the genotype differencesand give much better classification accuracy. For each cluster(or genotype), the data are divided into training and test sets.The training examples are then passed to the FeatureSelection Module to find the useful features (genetic markersites) for classification. The potentially useful features(attributes) are extracted and passed to the ClassifierLearning Module wherein a classifier is learned. The featuresselected are also sent to the preprocessing module to extractthe values of these features in the testing data set for testing inthe Classification module. Finally, the prediction results ofthe classifier are verified and evaluated based on the testingexamples. If the evaluation results are unsatisfactory, i.e.,stopping criteria are not satisfied, the learning process is

LEUNG ET AL.: DATA MINING ON DNA SEQUENCES OF HEPATITIS B VIRUS 429

Fig. 1. Data mining framework.

Page 3: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

repeated starting from the feature selection; otherwise, theclassifier will be validated by never-seen-before examples.The following sections will explain how the features areselected and also the basic principles of the classifier.

2.1 Molecular Evolutionary Analysis

Serum examples from 49 patients infected with HBVgenotype C, as determined by previous genotype-specificrestriction fragment length polymorphism analysis, werestudied [3]. All serum examples were kept in a�80�C freezerfor storage. All patients were ethnic Chinese and werefollowed up in the Hepatitis Clinic of the Prince of WalesHospital (Hong Kong). All patients were positive forhepatitis B surface antigen for at least six months and hadno evidence of hepatocellular carcinoma. Sixty-nine full-genome nucleotide sequences of HBV genotype C and12 full-genome nucleotide sequences of nongenotype CHBV were also retrieved from the GenBank database forcomparison. All reference sequences from GenBank werederived from patients with chronic hepatitis B; HBVnucleotide sequences from patients with acute hepatitis Bhepatocellular carcinoma or patients treated with antiviralagents were excluded. The geographical origins of patientsharboring different HBV genotype C genomes in GenBankwere retrieved from the respective original publications andthe descriptions in the GenBank database.

The full-genome nucleotide sequences of the isolates ofHBV genotype C from our center were compared with thoseof the isolates of HBV genotype C and nongenotype C HBVretrieved from the GenBank database. Nucleotide se-quences are multiple-aligned using ClustalW version 1.83[12] and corrected manually by visual inspection. Geneticdistances are estimated by Kimura’s two-parameter methodand the phylogenetic trees are constructed by the neighbor-joining method [13], [14]. The reliability of the pairwisecomparison and phylogenetic tree analysis is assessed andassured by bootstrap resampling with 1,000 replicates.Phylogenetic and molecular evolutionary analyses are doneusing MEGA version 3.0 [15].

2.2 Clustering

Since different HBV subgroups are likely to be the result ofdivergence from genomic mutations over time and knowl-edge of the geographical distribution, genomic relatedness ofthe HBV genotype C subgroups will be useful in gaining anunderstanding of the spread of HBV in Asia. Hepatitis B virusgenotype B (HBV/B) has been classified into five subgeno-types. In [16], a phylogenetic analysis of the complete genomesequences from the examples obtained from the Arctic andthose from Japan and Asia revealed six distinct clusterswithin HBV/B. Within each HBV genotype C subgroup,several clusters with genomic resemblance to one another canbe identified. The most well-defined example is the cluster inOkinawa, where the prevalence of HBV genotype C is muchlower than that in the rest of Japan [17].

There are two genotypes, B and C, in the 200 plus HBVDNA sequences we collected specifically for this project.While genotype B HBV appears to be a homogenous group[18], the phylogenetic tree results show that there exist threemain clusters in the genotype C among the HBV strainscollected (Fig. 2) [7]. We label them as C1, C2, and C3,

respectively. Subgrouping of HBV genotype C was based onan intersubgroup difference of nucleotide sequence of> 4 percent [19]. This is in concordance with our previousphylogenetic analyses with published full-length sequencein the GenBank. The main reason for us to find markersseparately from within the clusters (subgenotypes) obtainedfrom clustering analysis is that these subgenotypes exhibitmutations (nucleotide site differences) caused by geogra-phical diversity which are not markers for carcinogenicdiagnosis. If we were to analyze all these subgenotype dataas one genotype group, their intergenotypic differenceswould become distracting noises in the data mining processfor markers. These three clusters can be identified by thecombinations of four nucleotides. These three clusters willbe analyzed separately in the classifier learning part.

2.3 Feature Selection Algorithm

The main purpose of feature selection [20], [21], [22], [23] isto reduce the number of features used in classification whilemaintaining acceptable classification accuracy. For example,the Sequential Forward Floating Selection (SFFS) algorithmproposed by Pudil et al. [24] was one of the commonly usedalgorithms [25]. The main advantage of this method is thatit produces a hierarchy of feature subsets with the bestselection for each dimension. However, we aim at globalperformance of the whole framework, so we adopt asimpler algorithm based on information gain to selectinitial features.

In our approach, information gain criterion [26] is used tofind the useful features to distinguish between the Control(normal) and the HCC (cancer) groups of HBV carriers.

Information gain is a common criterion for featureselection. The information gain of a feature (attribute) isthe uncertainty (entropy) that can be reduced if the attributeis used for classification. Hence, the information gainshould be aimed at the higher the better. Equation (1) isthe entropy E of an attribute X with m values,X1; X2; . . . ; Xm, and P ðXiÞ is the probability of the value Xi

EðXÞ ¼Xmi¼1

�P ðXiÞlog2 P ðXiÞ: ð1Þ

430 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 2, MARCH/APRIL 2011

Fig. 2. Phylogenetic tree of genotype C.

Page 4: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

Specific to a typical DNA classification problem, weassume that the data have K classes, C ¼ c1; c2; . . . ; cK . Foreach aligned site position, it has m possible nucleotidesV1; V2; . . . ; Vm. We define jckj; k ¼ 1; 2; . . . ; K, as the numberof sequences in class ck: jckij is the number of sequences inClass ck whose character at the aligned site is Vi; which canbe A, T, G, or C in our case. The Remainder of X; RðXÞ, isdefined as follows:

RðXÞ ¼Xmi¼0

PKk¼1 ckij jPKk¼1 ckj j

EðP ðc1iÞ; . . . ; P ðckiÞÞ: ð2Þ

Information Gain IGj of the aligned site j is thedifference between the original information content EðCÞof the data set and the amount of information needed toclassify all the unclassified data left in the data set afterapplying site j for classification

IGj ¼ EðCÞ �RðjÞ: ð3Þ

The features are ranked by the information gains, andthen the top-ranked features are chosen as the potentialattributes used in the classifier. A site with higherinformation gain will contribute more in the classificationand be able to distinguish more examples (cases).

2.4 Current Classification Algorithms

There are several common classification models such asNaıve Bayesian Network [27], [28], [29], Decision Tree,Neural Networks, and Rule Learning using EvolutionaryAlgorithm [30]. The learning processes of Naıve BayesianNetworks and Decision Tree are faster. However, they cannotcope well with feature interactions. Neural Networks aretreated as black box learning and it is difficult for a human tounderstand or interpret the classification explicitly.

However, Rule Learning using Evolutionary Algorithmperforms a global search and can cope with feature interac-tions better than the previous classification models [31], [32].Also, the classification rules generated are simple and easilyinterpretable by human experts who frequently use the samereasoning approach very much similar to the rules. Therefore,the Rule Learning Using Evolutionary Algorithm approach isclearly a better choice in terms of interpretability of theknowledge acquired through the classifier learned.

Rule learning tries to learn rules from a set of trainingdata (examples). It can be modeled as a search problem offinding the best rules that classify the training exampleswith minimum error. However, the search space can bevery large; a robust search algorithm is required. Here,Generic Genetic Programming (GGP) [33], [34], which is atype of the Evolutionary Algorithms (EA), is adopted as oursearch and optimization algorithm. First, a population isinitialized by generating individuals (a set of rules)randomly. A fitness function is used to evaluate how goodan individual is, that is, how many cases it can classifycorrectly. Then, some individuals are selected to evolve(generate) new individuals with the genetic operators.Individuals become better and better through the evolutionprocess until the termination criterion is met.

The input is the training data set, and the output is a ruleset, which can classify the training data with higheraccuracy. We assume that there are n features (attributes),

X ¼ x1; x2; . . . ; xn, and K classes, C ¼ c1; c2; . . . ; cK . For eachattribute xj, one of its mj values can be taken. Each ruleincludes two components: the antecedent (IF part) and theconsequence (THEN part), as follows:

IF ðx1 ¼ v1Þ ^ ðx2 ¼ v2Þ ^ � � � ^ ðxl ¼ vlÞ THEN Class isC ¼ ck, where the antecedent includes l ðl ¼ ½1; n�Þ uniqueattributes x1; x2; . . . ; xl 2 fx1; x2; . . . ; xng; v1; v2; . . . ; vl 2fA;G; T ; Cg, and ck; k 2 f1; 2; . . . ; Kg, is a certain class towhich the object is to be classified. In our case, we have onlytwo classes, namely HCC and CONTROL. There arel unique attributes present in and n� l attributes absentfrom each rule. Each attribute present in the antecedent canonly take one of its possible values, fA;G; T ; Cg. All therules in the output rule set are connected by ELSE IF,meaning that the order of application of the rules must befollowed.

We use a simple example to illustrate the rules deducedby the Rule Learning. For HBV data set B, which will beintroduced in the following section, we have learned therules for diagnosing liver cancer (HCC) and nonliver cancer(CONTROL) cases. The rules are given as follows:

IF A1762 and G1764 and C53, then HCC,ELSE IF T1762 and A1764 and CG2712, then HCC,

ELSE IF T1762 and A1764 and T2712 and C2525, then

HCC,

ELSE CONTROL.

Although Rule learning based on EA can interpret theinteraction of features, the degree of the interaction cannotbe analyzed exactly by a measure. Hence, we introduce theFuzzy Measure to describe the interaction with respect tothe classification. A new classification model is proposedbased on Nonlinear Integrals with respect to signed FuzzyMeasure in the following section.

2.5 Classification Based on Nonlinear Integrals

In classification, we are given a data set consisting ofN example records, called the training set, where eachrecord contains the value of a classifying attribute Y and thevalue of feature attributes x1; x2; . . . ; xm. Positive integerN isthe data size. The classifying attribute indicates the class towhich each example belongs, and it is a categorical attributewith values coming from an unordered finite domain. Theset of all possible values of the decisive attribute is denotedby C ¼ c1; c2; . . . ; cK , where each ck; k ¼ 1; 2; . . . ; K, refers toa specified class. The feature attributes are numerical, andtheir values are described by an m-dimensional vector,(fðx1Þ; fðx2Þ; . . . ; fðxmÞ). The range of the vector, a subset ofn-dimensional euclidean space, is called the feature space.Thus, the j example record consists of the jth observation forall feature attributes and the classifying attribute, and isdenoted by ðfjðx1Þ; fjðx2Þ; . . . ; fjðxmÞ; YjÞ; j ¼ 1; 2; . . . ; N [35].

In this section, a method of classification based onnonlinear integrals will be presented. It can be viewed asan idea of projecting the points in the feature space onto areal axis through a nonlinear integral, and then using a one-dimensional classifier to classify these points according to acertain criterion optimally. Our classifying attributes hold-ing the discrete value of A, C, G, or T is numericalized to bea virtual variable. All of these are realized under theguidance of an adaptive genetic algorithm [36]. Good

LEUNG ET AL.: DATA MINING ON DNA SEQUENCES OF HEPATITIS B VIRUS 431

Page 5: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

performance of this method comes from the use of the fuzzymeasure and the relevant nonlinear integral, since thenonadditivity of the fuzzy measure reflects the importanceof the feature attributes, as well as their inherent interac-tions, toward the discrimination of the points. In fact, eachfeature attribute has its, respective, important index reflect-ing its amount of contribution toward the decision.Furthermore, the global contribution of several featureattributes to classification is not just the simple sum of thecontribution of each feature to the decision, but may varynonlinearly. A combination of the feature attributes mayhave a mutually restraining or a complementary synergyeffect on their contributions toward the classificationdecision. Hence, the fuzzy measure defined on the powerset of all feature attributes is a proper representation of therespective importance of the feature attributes and theinteractions among them, and a relevant nonlinear integralis a good fusion tool to aggregate the information comingfrom the individual and the combinations of the featureattributes for the classification. The following are the detailsof these basic concepts and the mathematical model for theclassification problem.

2.5.1 Fuzzy Measures and Nonlinear Integrals

Let X ¼ x1; x2; . . . ; xm be a nonempty finite set of featureattributes and P ðXÞ be the power set of X.

Definition 3.1. A fuzzy measure � is a mapping from P ðXÞ to½0;1Þ satisfying the following conditions:

1. �ð�Þ ¼ 0 and2. A � B) �ðAÞ � �ðBÞ; 8A;B 2 P ðXÞ.

The set function � is nonadditive in general. If �ðXÞ ¼ 1,then � is said to be regular.

To further understand the practical meaning of the fuzzymeasure, let us consider the elements in a universal set X asa set of predictive attributes to predict a certain objective.Then, for each individual predictive attribute as well aseach possible combination of the predictive attributes, adistinct value of a fuzzy measure is assigned to describe itsinfluence to the objective. Due to the nonadditivity of thefuzzy measure, the influences of the predictive attributes onthe objective are dependent in a manner such that the globalcontribution of them to the objective is not just the simplesum of their individual contributions.

Example 3.1. Assume that we have observed threesymptoms of a patient and want to determine whichdisease he or she is suffering from. The symptoms areregarded as the information sources, which form theuniversal set denoted by X ¼ fx1; x2; x3g. Their indivi-duals as well as joint influences on the prediction ofdisease are specified by a fuzzy measure � defined inTable 1.

Here, �ðfx2; x3gÞ > �ðfx2gÞ þ �ðfx3gÞ indicates that thejoint contribution of x2 and x3 to the diagnosis is greater thanthe sum of their individual contributions. This shows that theinteraction between x2 and x3 enhances the influence of eachother. On the other hand, �ðfx1; x2gÞ < �ðfx1gÞ þ �ðfx2gÞshows that x1 and x2 are restraining each other. Note that theessential properties of fuzzy measure are monotonicity and

vanishing at the empty set. This implies that fuzzy measure

only allows its value to be nonnegative.However, the monotonicity and nonnegativity of fuzzy

measure are too restrictive for real applications. In this

paper, we assume that � is a signed fuzzy measure on P ðXÞ,i.e., � : P ðXÞ ! ð�1;þ1Þ and �ð�Þ ¼ 0. For convenience,�ðfx1gÞ; �ðfx2gÞ; . . . ; �ðfxngÞ; �ðfx1; x2gÞ; . . . ; �ðfx1; x2;

. . . ; xngÞ are sometimes abbreviated as �1; �2; . . .�n; �12; . . . ;

�12���n, respectively.

Definition 3.2. Let f be a nonnegative function on X. The

Nonlinear integral of f with respect to �,Rfd�, is defined by

Zfd� ¼

Z1

0

�ðF�Þd�;

where F� ¼ fxjfðxÞ � �g for any � 2 ½0;1Þ, is the �-level

set of f .

To calculate the value of the Nonlinear integral of a given

function f , the values of f , ffðx1Þ; fðx2Þ; . . . ; fðxmÞg, should

be first arranged into an increasing order, that is,

fðx01Þ � fðx02Þ � � � � � fðx0mÞ;

where ðx01; x02; . . .x0mÞ is a certain permutation of

ðx1; x2; . . .xmÞ. Then, the value of the Nonlinear Integral

can be obtained by computing

ðcÞZfd� ¼

Xmi¼1

½fðx0iÞ � fðx0i�1Þ��ðfx0i; x0iþ1; . . .x0mgÞ;

where fðx00Þ ¼ 0.

2.5.2 Nonlinear Integral Projection Based

on the Nonlinear Integral

We can build an aggregation tool that projects the feature

space onto a real axis. Under the projection, each point in

the feature space becomes a value of the virtual variable.A point ðfðx1Þ; fðx2Þ; . . . ; fðxmÞÞ, denoted by ðf1; f2; . . . ;

fmÞ, simply, in the feature space can be regarded as a

function f defined on X. It represents an observation of the

feature attributes. Point f is projected to be Y , the value of

the virtual variable, on a real axis through a nonlinear

integral defined by

Y ¼Zfd�:

Once the value of � is determined, we can calculate the

virtual value Y from f .

432 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 2, MARCH/APRIL 2011

TABLE 1An Example of Fuzzy Measure Defined on X ¼ fx1; x2; x3g

Page 6: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

2.5.3 GA-Based Adaptive Classifier

The next step is to find an appropriate formula that projects

the n-dimensional feature space onto a real axis L such that

each point f ¼ ðf1; f2; . . . ; fmÞ becomes a value of the virtualvariable that is optimal with respect to the classification. In

such a way, each classification boundary is just a point onreal axis L.

The classification process can be divided into two partsfor implementation:

Step 1. The nonlinear integral classifier depends on the

fuzzy measure �, so the first step is to determine the

optimal values of � by using the GA tool. In fact, the fitnessfunction comes from the linear classifier used in the second

procedure. It is an iterative process. The optimal FuzzyMeasure will be the output to the next step.

Step 2. When the fuzzy measure � is determined, thevirtual value can be obtained using the Nonlinear Integral.

Then, we can classify these virtual values on real axis usinga linear classifier.

The following section focuses on the above problems.GA-based learning fuzzy measure. Here, we discuss the

optimization of the fuzzy measure � under the criterion of

minimizing the corresponding global misclassification rate,which is obtained in the second part above.

In our GA model, we use a variant of the originalfunction f; f 0 ¼ aþ bf , where a is a vector to shift the

coordinates of the data and b is a vector to scale the valuesof predictive attributes. Each chromosome represents fuzzy

measure vector �, shifting vector a, and scaling vector b. A

signed fuzzy measure is 0 at empty set. If there arem attributes in training data, a chromosome has 2m � 1þ2m genes which are set to random real values at initializa-

tion. Genetic operations used are traditional ones. At eachgeneration, for each chromosome, all variables are fixed and

the virtual values of all training data are calculated using anonlinear integral. The fitness function can be defined as

follows:

fitness ¼ !1 accuracyþ !2 sensitivity;

where !1 and !2 are the adjustment parameters given by

users. Accuracy and sensitivity are determined in thesecond part of the model.

Linear classifier for the virtual values. After determin-ing the fuzzy measure �, shifting vector a, scaling vector b,

and the respective classification function from the trainingdata in GA, points in the m-dimensional feature space are

projected onto a real axis using a nonlinear integral.We use Fisher’s linear discriminant function to perform

classification in this one-dimensional space [37], [38].Positive and negative centroids for projected data are

determined by the following formulas:

mþ ¼P

i:yi¼1 xiPi:yi¼1 1

; m� ¼P

i:yi¼�1 xiPi:yi¼�1 1

:

Ronald Fisher defined the scatter matrices as

S �X

xi:yi¼1

ðxi �mÞðxi �mÞ0:

SW ¼ Sþ þ S� is called the Within-Class Scatter Ma-trix. Similarly, the Between-Class Scatter Matrix can bedefined as

SB � ðmþ �m�Þðmþ �m�Þ0:

Hence, this result in an equivalent expression for Fisher’sdiscriminant criterion is a ratio between two quadraticforms as

JðwÞ ¼ w0SBw

w0SWw;

in which w represents the direction of the projection space,i.e., the one-dimensional space. We can solve the program-ming problem by maximizing JðwÞ. The optimal w can berepresented as w ¼ S1

w ðmþ �m�Þ. So, the Fisher’s dis-criminant function is formulated as

y ¼ w ðx� nþ mþ � n� m�Þ;

in which n is the sum of observations in each class,respectively. Finally, a threshold needs to be fixed in orderto define a complete classifier.

3 METHODS

We applied EA-based Rule Learning [39] and NonlinearIntegral classifiers to classify the HBV DNA data into livercancer (HCC) and normal (CON, control) classes, and thencompare them with several classical classification methodswhich include See5.0 (Decision Tree) [40], Neural Network[41], SVM [42], and Naıve Bayes [43]. As mentioned before,we conduct a detailed study on the Rule Learning andNonlinear Integral classifier separately. In this section, wewill give brief descriptions about these classical methods ofclassification and the data sets used. Then, the implementa-tion details of the Nonlinear Integral (NI) classifier and theevaluation methodology will be introduced.

3.1 Methods Description

The following paragraphs are the brief descriptions of thefour classical methods used to compare with our newmethods.

3.1.1 Decision Tree [40]

A decision tree is a tree-structured classifier. The DecisionTree method learns decision tree using a recursive treegrowing process. Each test corresponding to an attribute isevaluated on the training data using a test criteria function.The test criteria function assigns each test a score based onhow well it partitions the data set. The test with the highestscore is selected and placed at the root of the tree. Thesubtrees of each node are then grown recursively byapplying the same algorithm to the examples in each leaf.The algorithm terminates when the current node containseither all positive or all negative examples. We used thewidely available package—See5.0, which is the state-of-the-art of the Decision Tree classifier.

3.1.2 Neural Network [41]

An Artificial Neural Network (ANN), or commonly justcalled neural network (NN), is an interconnected group of

LEUNG ET AL.: DATA MINING ON DNA SEQUENCES OF HEPATITIS B VIRUS 433

Page 7: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

artificial neurons that uses a mathematical or computationalmodel for information processing based on a connectionisticapproach to computation. In most cases, an NN is anadaptive system that changes its structure or weights of theinterconnections based on external and internal information(stimuli) that flows through the network.

In more practical terms, NNs are nonlinear statisticaldata modeling for decision-making and classification tools.They can be used to model complex relationships betweeninputs and outputs or to find patterns in data. However, itis essentially a black box approach, and it is not easy tointerpret how they function.

3.1.3 Support Vector Machine [42]

SVMs are a set of related supervised learning methods usedfor classification and regression. Viewing input data as twosets of vectors in an n-dimensional space, an SVM willconstruct a separating hyperplane in that space, whichmaximizes the margin between the two data sets. Tocalculate the margin, two parallel hyperplanes are con-structed, one on each side of the separating hyperplane,which are “pushed up against” the two data sets.Intuitively, a good separation is achieved by the hyperplanethat has the largest distance to the neighboring data pointsof both classes, since, in general, the larger the margin thesmaller the generalization error of the classifier.

The original optimal hyperplane algorithm proposed byVladimir Vapnik in 1963 was a linear classifier. Theclassification model produced by SVM (as described above)only depends on a subset of the training data, because thecost function for building the model does not care abouttraining points that lie beyond the margin. In our paper, thesoftware used is obtained from [42].

3.1.4 Naıve Bayes [43]

A Naıve Bayes classifier is a simple probabilistic classifierbased on applying Bayes’ theorem with strong (naıve)independence assumptions. A more descriptive term for theunderlying probability model would be “independentfeature model.”

Depending on the precise nature of the probabilitymodel, naıve Bayes classifiers can be trained very efficientlyin a supervised learning setting. In many practical applica-tions, parameter estimation for naıve Bayes models uses the

method of maximum likelihood; in other words, one can

work with the naıve Bayes model without believing in

Bayesian probability or using any Bayesian method.In spite of their oversimplified assumptions, Naıve Bayes

classifiers often work much better in many complex real-

world situations than one might expect. Recently, careful

analysis of the Bayesian classification problem has shown

that there are some theoretical reasons for the apparently

unreasonable efficiency of Naıve Bayes classifiers [44]. An

advantage of the Naıve Bayes classifier is that it requires a

small amount of training data to estimate the parameters

(means and variances of the variables) necessary for

classification. Because independent variables are assumed,

only the variances of the variables for each class need to be

determined and not the entire covariance matrix.

3.2 Implementation Details of Nonlinear Integral

To implement the learning algorithm of our new classifier

based on nonlinear integrals, we use the GA tool in Matlab

v7.2 Programming combined with Fisher’s discriminant

function programming [45]. All the parameters of our GA in

our experiments are shown in Table 2. We set the

generation limit to be 100 as the stopping criteria.

3.3 Data Description

The data set contains 98 control patients and 100 HCC

patients. The HBV DNA sequences are obtained specifically

for this study from these patients carefully selected by our

medical experts to minimize the demographic bias. There

are four data sets corresponding to the different clusters,

namely B, C1, C2, and C3. The numbers of patients for each

data set are shown in Table 3 in which the last column

represents the proportion of each data set. For each data set,

an independent validation set is prepared to evaluate the

performance of the classifiers. Table 4 shows the number of

patients of the validation data sets.

434 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 2, MARCH/APRIL 2011

TABLE 2GATool Parameters in Matlab

TABLE 3The Details of HBV Data Sets

TABLE 4Summary of Validation Sets

Page 8: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

3.4 Evaluation Methodology

In classifying an unknown case, depending on the classpredicted by the classifier and the true class of the patient(Control or HCC), four possible types of results can beobserved for the prediction as follows:

1. True positive—the result of the patient has beenpredicted as positive (Cancer) and the patient hascancer.

2. False positive—the result of the patient has beenpredicted as positive (Cancer) but the patient doesnot have cancer.

3. True negative—the result of the patient has beenpredicted as negative (Control), and indeed, thepatient does not have cancer.

4. False negative—the result of the patient has beenpredicted as negative (Control) but the patient hascancer.

Let TP, FP, TN, and FN, respectively, denote the numberof true positives, false positives, true negatives, and falsenegatives. For each learning and evaluation experiment,Accuracy, Sensitivity, and Specificity defined below are usedas the fitness or performance indicators of the classification:

Accuracy ¼ ðTP þ TNÞ=ðTP þ TN þ FP þ FNÞ;Sensitivity ¼ TP=ðTP þ FNÞ;Specificity ¼ TN=ðTN þ FP Þ:

For screening tests, medical professionals usually willprefer to have higher sensitivity, i.e., lower accuracy andspecificity is an acceptable trade-off for high sensitivity aslong as the accuracy and specificity are reasonable. It meansthat we rather send more people for confirmation tests thanmiss any true cancer patients. In the data sets, all attributesare categorical attributes. There are four symbolic values A,C, G, and T for each attribute. In order to use the nonlinearmodel, we use simple integer values, 0, 1, 2, and 3, as thenumericalized initial values to represent the discrete valuesof the attributes, respectively.

We adopt K-fold cross-validation method to make surethat the whole data set can be used as testing data in turnand overtraining (overfitting) can be avoided. It meansthat we randomly partition the n data into K sets withsize of n=K, which are trained on (K � 1) sets and testedon the remaining set, and repeated K times in turn thustaking the mean result. After K runs, all data are used fortesting and the average can be computed to evaluate theperformance. The K-fold method is repeated 10 times foreach experiment (10�K runs in total) to obtain an overallaverage performance.

Our data sets are very small despite the fact that theyare one of the biggest single studies. For example, C1contains DNA sequences from 26 individuals. We mustensure that there is at least one positive case for each classin the testing data set. If the number (K) of splits is toolarge, the size of the testing set will be too small, and itmay not even have a positive case. On the other hand, ifthe numbers of splits are too small, it will result in smalltraining sets which may not contain sufficient informationfor training. So, we need to find a balance between thesizes of training and testing sets in order to reduce the

probability of overtraining (overfitting) and undertesting(i.e., not enough positive and negative examples fortesting). We have tried several feasible K values forNonlinear Integrals. The results are shown in Table 5. Wecan see that the testing accuracy and sensitivity are bestby taking 10-fold. Consequently, we have chosen a 10-foldmethod for our experiments. This 10-fold methodology isapplied to all experiments including the classical classifierused in our comparison.

4 EXPERIMENTAL RESULTS

In this section, we first present the results of EA-based RuleLearning [39] and Nonlinear Integral classifiers to classifythe HBV DNA data into liver cancer (HCC) and normal(CON, control) classes, and then compare them with severaltraditional classification methods which include See5.0(Decision Tree) [40], Neural Network [41], SVM [42], andNaıve Bayes [43]. As mentioned before, we do a detailedstudy on the Rule Learning and Nonlinear Integral classifierseparately because of the importance of their high inter-pretability of the models representing the knowledgeacquired through the learning processes. The biochemistsand doctors can see explicitly and clearly the influences ofthe mutated sites or markers and their potential interactionstoward the formation of liver cancer.

For each data set, we will use the five attributes whichinclude those selected by the Rule Learning method, and insome cases, supplemented by those with the highest

LEUNG ET AL.: DATA MINING ON DNA SEQUENCES OF HEPATITIS B VIRUS 435

TABLE 5All Splits for Training and Testing

on Nonlinear Integral Classifier

Page 9: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

information gain obtained by Viewer [46] partially shown in

Fig. 9. For reducing computational complexity, we reduce

the number of attributes by including the feature selection

method. We compared the results of Nonlinear Integral

with and without feature selection in Table 6. It shows that

feature selection is very useful.

4.1 Comparison between NIC and RL

Table 7 shows the comparison results of Rule Learning andthe NIC, and Table 8 shows the comparison results of ourmethods with several classic methods on data sets B, C1,C2, and C3. The results of RL and NIC for each data set anda validation set, which contains the never-seen-before cases,are shown in Table 7.

In Table 7, sensitivity results of NIC are higher thanthose of RL in most cases and other values are comparable.Since sensitivity is more important for doctors to diagnose,the performance of NIC is considered to be better than thatof RL. Furthermore, NIC can not only determine theimportant sites (markers) with regard to the diagnosis butalso give their degrees of contribution in real values, whichare relatively meaningful in biomedical research. This willbe described in the following section.

4.2 Results of Classifier Based on NonlinearIntegrals Compared with Other Methods

Table 8 shows the comparison results of the Nonlinear

Integrals Classifier with five classical algorithms which

include NN, Decision Tree (DT), Naıve Network (NB),

SVM, and RL.We run six sets of experiments for each classifier. The

first set of experiments uses the top one site (attribute with

the highest information gain), the second set the top two

sites, the third one uses top three sites, and so on. The

results are evaluated mainly according to the test accuracy

and sensitivity for HBV data. Finally, the best result out of

the six sets of experiments for each method is selected for

comparison. Thus, we can see the optimal results of the NIC

method based on GA compared against the other methods

in Table 8.For weighted average results in Table 8, the best ones are

bolded. The weighted average is computed according to the

number of cases in each data set. The sensitivity and

accuracy of our NIC is better than most algorithms or is at

least comparable. For comparing the performance of all

algorithms graphically, we plot all the results in Figs. 3, 4, 5,

6, and 7. Meanwhile, we place all methods on an ROC space

as in Fig. 8 for helping interpret the results in Table 8.

436 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 2, MARCH/APRIL 2011

TABLE 6Comparison Results of Nonlinear Integral

with and without Feature Selection

TABLE 7Results of RL and NIC for Each Data Set

TABLE 8Comparison Results with Classical Methods for All Data Sets

Page 10: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

4.3 Comments on Results

Our framework includes RL and NI. RL has slightly higheraccuracy than NI. It means that this method can have higherprediction power. But for doctors and clinicians, thesensitivity is more important than the accuracy, and NI isbetter than RL on sensitivity.

Compared with the four traditional methods, namely,NN, DT, NB, and SVM, NI shows the best diagnosticperformance on the average evaluations. NI not only hascomparable accuracy, it can also show the interaction ofattributes. How to identify the importance of attributes andtheir combinations will be introduced in the next section.

4.4 To Identify Important Sites and Interactionsamong Them

Another important contribution of the nonlinear integralclassifier is that we can find some significant sites (markers)and interactions among them in the sequences for further

wet laboratory analyses. According to the definition ofNonlinear Integrals, for each data set, we can get a set oflinear equations about the signed fuzzy measures asvariables. A solution with the fewest nonzero values canbe obtain by solving linear equations based on L1-Normregularization [47].

The respective potential sites according to informationgain in the sequences for xi computed by Viewer [46] arelisted in Table 9. Fig. 9 is the screenshot from the Viewer forData Set C1. The left column represents the site numbersand the right column is the corresponding information gainvalues ranked in decreasing order.

LEUNG ET AL.: DATA MINING ON DNA SEQUENCES OF HEPATITIS B VIRUS 437

Fig. 3. The comparison of all methods for B.

Fig. 4. The comparison of all methods for C1.

Fig. 5. The comparison of all methods for C2.

Fig. 6. The comparison of all methods for C3.

Fig. 7. The comparison of all methods as weighted average.

Fig. 8. Classifiers in ROC space.

TABLE 9The Top 5 Sites No. of Sequences for Each Data Set

Fig. 9. The screenshot of Viewer in information gain order.

Page 11: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

For the B, C1, C2, and C3 data sets, the top five sites areused to formulate the set of linear equations. So, weobtained the solutions which have the fewest nonzeros andfiltered those positions with zero. In Table 10, we show theimportance and relevance of the individual sites and theirinteractions.

From Table 10, we can see that many sites of sequence do

not take effect individually or combined with others. The

nonzero sites are important for diagnosing disease. These

may be helpful for bioinformatics and medical research.

5 DISCUSSIONS AND FUTURE WORK

In this paper, a data mining framework for DNA sequencebiological data sets has been presented. It has been appliedto the Hepatitis B Virus DNA data sets which are amongthe largest in the world and have been collected by ourmedical school specifically for this project. We havedeveloped a framework for markers discovery. This frame-work has incorporated two algorithms, NIC and RL. Bothclassifiers can explicitly give the importance of the markersand their interactions and have shown good performance incancer prediction.

Moreover, the details of the new classification methodbased on nonlinear integral have been presented. This

method has good performance using the fuzzy measure andthe nonlinear integral, due to the nonadditivity of the fuzzymeasure reflecting the importance of the individual featureattributes as well as their inherent interactions. Besides thehigh interpretability of the Nonlinear Integrals Classifier,the experimental results have shown that it is one of thebest classifiers especially in terms of sensitivity. It is veryuseful for preliminary diagnosis and screening test of livercancer caused by HBV. In our model, we use GA foroptimization which provides multimodal solutions contain-ing sets of best solutions. The final confirmation experi-ments, like many other bioinformatics problems, need to becarried out by biochemists to identify and study the truemarkers. Finally, we have used a regularization method toget a solution with the fewest nonzero fuzzy measurevalues. It can provide some important individual andcombinations of key markers of the HBV DNA sequences.We believe that this information can be helpful to do furtherresearch for biochemists.

We hypothesize that the genomic makeup of HBV affectsthe carcinogenic potential of the virus. In this case-controlstudy, we have demonstrated that some genotype-specificmutations are more commonly found among HCC patientsthan their age and gender-matched controls. These markerscan therefore be used as biomarkers to stratify the cancerrisk of chronic hepatitis B patients. Our findings have beenvalidated by independent data sets in the validationprocess. To confirm the biological role of these mutations,further experimental work using in situ mutagenesis ofreplicative HBV clones on their carcinogenicity in animaland cell line models will be required.

However, even though we have generated one of thelargest data sets, the example sizes of the data sets are stillsmall (less than 100) for each case. It is a challenge for theclassifier based on nonlinear integral to avoid overtraining.

ACKNOWLEDGMENTS

This research is partially supported by grants from theResearch Grants Council of the Hong Kong SAR, China(Projects CUHK414107 and CUHK414708).

REFERENCES

[1] R.P. Beasley, L.Y. Hwang, C.C. Lin, and C.S. Chien, “Hepatocel-lular Carcinoma and Hepatitis B Virus. A Prospective Study of22 707 Men in Taiwan,” Lancet, vol. 2, pp. 1129-1133, 1981.

[2] J.H. Kao, P.J. Chen, M.Y. Lai, and D.S. Chen, “Hepatitis BGenotypes Correlate with Clinical Outcome in Patients withChronic Hepatitis B,” Gastroenterology, vol. 118, pp. 554-559, 2000.

[3] H.L.Y. Chan et al., “Genotype C Hepatitis B Virus Infection IsAssociated with an Increased Risk of Hepatocellular Carcinoma,”Gut, vol. 53, pp. 1494-1498, 2004.

[4] H. Sumi, O. Yokosuka, N. Seki, M. Arai, F. Imazeki, T. Kurihara, T.Kanda, K. Fukai, M. Kato, and H. Saisho, “Influence of Hepatitis BVirus Genotypes on the Progression of Chronic Liver Disease,”Hepatology, vol. 37, pp. 19-26, 2003.

[5] M.F. Yuen, Y. Tanaka, M. Mizokami, J.C. Yuen, D.K. Wong, H.J.Yuan, S.M. Sum, A.O. Chan, B.C. Wong, and C.L. Lai, “Role ofHepatitis B Virus Genotypes Ba and C, Core Promoter and PrecoreMutations on Hepatocellular Carcinoma: A Case Control Study,”Carcinogenesis, vol. 25, pp. 1593-1598, 2004.

[6] H.L.Y. Chan, C.H. Tse, E.Y.T. Ng, K.S. Leung, K.H. Lee, K.W. Tsui,and J.J.Y. Sung, “Phylogenetic, Virological and Clinical Character-istics of Genotype C Hepatitis B Virus with Tcc at Codon 15 of thePrecore Region,” J. Clinical Microbiology, vol. 44, no. 3, pp. 681-687,2006.

438 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 2, MARCH/APRIL 2011

TABLE 10The Signed Fuzzy Measure of Each Site Used in Each Data Set

Page 12: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

[7] H.L.Y. Chan, S.K.W. Tsui, E.Y.T. NG, P.C.H. Tse, K.S. Leung, K.H.Lee, T. Mok, A. Bartholomeuz, T.C.C. Au, and J.J.Y. Song,“Epidemiological and Virological Characteristics of Two Sub-groups of Genotype C Hepatitis Virus,” J. Infectious Diseases,vol. 191, pp. 2022-2032, 2005.

[8] T. Laskus, L.-F. Wang, M.R.H. Vargas, and J. Cianciara,“Comparison of Hepatitis B Virus Core Promoter Sequences inPeripheral Blood Mononuclear Cells and Serum From Patientswith Hepatitis B,” J. General Virology, vol. 78, pp. 649-653, 1997.

[9] W.K. Keum, J.Y. Kim, J.Y. Kim, S.G. Chi, H.J. Woo, S.S. Kim, J. Ha,and I. Kang, “Heterogeneous HBV Mutants Coexist in KoreanHepatitis B Patients,” Experimental and Molecular Medicine, vol. 30,no. 2, pp. 115-122, June 1998.

[10] R.B. Potter and S. Draghici, “A Soft Approach to Predicting HIVDrug Resistance,” Proc. Pacific Symp. Biocomputing (PSB ’02), 2002.

[11] A. Ciancio, A. Smedile, and M. Rizzetto, “Identification of HBVDNA Sequences that Are Predictive of Response to LamivudineTherapy,” Hepatology, vol. 39, pp. 64-73, 2004.

[12] J.D. Thompson, D.G. Higgins, and T.J. Gibson, “CLUSTAL W:Improving the Sensitivity of Progressive Multiple SequenceAlignment through Sequence Weighting, Position Specific GapPenalties and Weight Matrix Choice,” Nucleic Acids Research,vol. 22, pp. 4673-4680, 1994.

[13] M. Kimura, “A Simple Method for Estimating Evolutionary Ratesof Base Substitutions through Comparative Studies of NucleotideSequences,” J. Molecular Evolution, vol. 16, pp. 111-120, 1980.

[14] N. Saitou and M. Nei, “The Neighbor-Joining Method: A NewMethod for Reconstructing Phylogenetic Trees,” Molecular Biologyand Evolution, vol. 4, pp. 406-425, 1987.

[15] S. Kumar, K. Tamura, and M. Nei, “MEGA3: Integrated Softwarefor Molecular Evolutionary Genetics Analysis and SequenceAlignment,” Brief Bioinformatics, vol. 5, pp. 150-163, 2004.

[16] T. Sakamoto, Y. Tanaka, J. Simonetti, C. Osiowy, M.L. Børresen, A.Koch, F. Kurbanov, M. Sugiyama, G.Y. Minuk, B.J. McMahon, T.Joh, and M. Mizokami, “Classification of Hepatitis B VirusGenotype B into 2 Major Types Based on Characterization of aNovel Subgenotype in Arctic Indigenous Populations,” J. InfectiousDiseases, vol. 196, pp. 1487-1492, 2007.

[17] E. Orito et al., “Geographic Distribution of Hepatitis B Virus(HBV) Genotype in Patients with Chronic HBV Infection inJapan,” Hepatology, vol. 34, pp. 590-594, 2001.

[18] F. Sugauchi, H. Kumada, H. Sakugawa, M. Komatsu, H. Niitsuma,H. Watanabe, Y. Akahane, H. Tokita, T. Kato, Y. Tanaka, E. Orito,R. Ueda, Y. Miyakawa, and M. Mizokami, “Two Subtypes ofGenotype B (Ba and Bj) of Hepatitis B Virus in Japan,” ClinicalInfectious Diseases, vol. 38, pp. 1222-1228, 2004.

[19] S.M. Owyer and J.G.M. Sim, “Relationships within and betweenGenotypes of Hepatitis B Virus at Points Across the Genome:Footprints of Recombination in Certain Isolates,” J. GeneralVirology, vol. 81, pp. 379-392, 2000.

[20] H. Almuallim and T. Dietterich, “Learning Boolean Concepts inthe Presence of Many Irrelevant Features,” Artificial Intelligence,vol. 69, nos. 1/2, pp. 179-305, 1994.

[21] M. Dash and H. Liu, “Feature Selection for Classification,”Intelligent Data Analysis, vol. 1, no. 3, pp. 131-156, 1997.

[22] G. John, R. Kohavi, and K. Pdlwfwe, “Irrelevant Features and theSubset Selection Problem,” Proc. 11th Int’l Conf. Machine Learning,pp. 121-129, 1994.

[23] P. Langley, “Selection of Relevant Features in Machine Learning,”Proc. AAAI Fall Symp. Relevance, pp. 1-5, 1994.

[24] P. Pudil, J. Novovicoca, and J. Kittler, “Floating Search Methods inFeature Selection,” Pattern Recognition Letters, vol. 15, pp. 1119-1125, Nov. 1994.

[25] A. Jain and D. Zongker, “Feature Selection: Evaluation, Applica-tion, and Small Example Performance,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 19, no. 2, pp. 153-158, Feb.1997.

[26] T.M. Mitchell, Machine Learning. The Mc-Graw-Hill Companies,Inc., 1997.

[27] C. Eugene, “Bayesian Network without Tears,” AI Magazine,vol. 12, no. 4, pp. 50-63, 1991.

[28] D.M. Chickering, D. Heckerman, and D. Geiger, “LearningBayesian Networks: The Combination of Knowledge and Statis-tical Data,” Machine Learning, vol. 20, pp. 197-243, 1995.

[29] W. Liu, J. Cheng, and A.B. David, “An Algorithm for BayesianBelief Network Construction from Data,” Proc. Sixth Int’l WorkshopArtificial Intelligence and Statistics, 1997.

[30] W. Banzaf, P. Nordin, R. Keller, and F. Francone, GeneticProgramming—An Introduction. Morgan Kaufmann, 1997.

[31] A.A. Freitas, “A Survey of Evolutionary Algorithms for DataMining and Knowledge Discovery,” Advances in EvolutionaryComputation, A. Ghosh and S. Tsutsui, eds., Springer-Verlag, 2002.

[32] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K.Jain, “Dimensionality Reduction Using Genetic Algorithms,” IEEETrans. Evolutionary Computing, vol. 4, no. 2, pp. 164-171, July 2000.

[33] M.L. Wong and K.S. Leung, “Learning Recursive Functions fromNoisy Examples Using Generic Genetic Programming,” Proc. FirstAnn. Conf., pp. 238-246, 1996.

[34] M.L. Wong and K.S. Leung, Data Mining Using Grammar BasedGenetic Programming and Applications. Kluwer Academic Publish-ers, Jan. 2000.

[35] K.B. Xu, Z.Y. Wang, P.A. Heng, and K.S. Leung, “Classification byNonlinear Integral Projections,” IEEE Trans. Fuzzy Systems, vol. 11,no. 2, pp. 187-201, Apr. 2003.

[36] Z.Y. Wang, K.S. Leung, and J. Wang, “A Genetic Algorithm forDetermining Nonadditive Set Functions in Information Fusion,”Fuzzy Sets and Systems, vol. 102, pp. 463-469, 1999.

[37] G.J. McLachlan, Discriminant Analysis and Statistical PatternRecognition. Wiley, 1992.

[38] S. Mika, A.J. Smola, and B. Scholkopf, “An Improved TrainingAlgorithm for Fisher Kernel Discriminants,” Proc. ArtificialIntelligence and Statistics (AISTATS ’01), T. Jaakkaola andT. Richardson, eds., pp. 98-104, 2001.

[39] M.L. Wong and K.S. Leung, “Genetic Logic Programming andApplications,” IEEE Expert, vol. 10, no. 5, pp. 68-76, Oct. 1995.

[40] Data Mining Tools See5 and C5.0, Software, http://www.rulequest.com/see5-info.html, May 2006.

[41] SAS1 Enterprise Miner (EM), http:/www.sas.com/technologies/analytics/datamining/miner/, 2009.

[42] C.C Chang and C.J. Lin, “LIBSVM: A Library for Support VectorMachines,” Software, http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.

[43] C. Borgelt, Bayes Classifier Induction, Software, http://fuzzy.cs.uni-magdeburg.de/~borgelt/bayes.html, 2009.

[44] H. Zhang, “The Optimality of Naive Bayes,” Proc. 17th Int’l FloridaAlliance of Information and Referral Services (FLAIRS) Conf., 2004.

[45] C.M. Van Der Walt and E. Barnard, “Data Characteristics ThatDetermine Classifier Performance,” Proc. 16th Ann. Symp. PatternRecognition Assoc. of South Africa, pp. 160-165, http://www.patternrecognition.co.za, 2006.

[46] K.S. Leung, Y.T. Ng, K.H. Lee, L.Y. Chan, K.W. Tsui, T. Mok, C.H.Tse, and J. Sung, “Data Mining on DNA Sequences of Hepatitis BVirus by Nonlinear Integrals,” Proc. Taiwan-Japan Symp. FuzzySystems & Innovational Computing, Third Meeting (Keynote Speech),pp. 1-10, Aug. 2006.

[47] M.Y. Park, and T. Hastie, “L1-Regularization Path Algorithm forGeneralized Linear Models,” J. Royal Statistical Soc.: Series B(Statistical Methodology), vol. 69, no. 4, pp. 659-677, 2007.

Kwong-Sak Leung (M’77-SM’89) received theBSc (Eng.) and PhD degrees from the Universityof London, Queen Mary College, in 1977 and1980, respectively. He was a senior engineer oncontract R&D at ERA Technology and laterjoined the Central Electricity Generating Boardto work on nuclear power station simulators inEngland. In 1985, he joined the ComputerScience and Engineering Department at theChinese University of Hong Kong, where he is

currently a professor of computer science and engineering. His researchinterests include soft computing and bioinformatics including evolu-tionary computation, parallel computation, probabilistic search, informa-tion fusion and data mining, and fuzzy data and knowledge engineering.He has authored or coauthored more than 200 papers and two books infuzzy logic and evolutionary computation. He has been a chair andmember of many program and organizing committees of internationalconferences. He is on the Editorial Board of Fuzzy Sets and Systemsand an associate editor of the International Journal of IntelligentAutomation and Soft Computing. He is a senior member of the IEEE, achartered engineer, a member of the IET and the ACM, a fellow of theHKIE, and a distinguished fellow of the HKCS in Hong Kong.

LEUNG ET AL.: DATA MINING ON DNA SEQUENCES OF HEPATITIS B VIRUS 439

Page 13: 428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY …ksleung/download_papers/TCCB_8(2... · 2014-03-03 · PCR technology was used to differentiate the nucleotide variant [6]. Because

Kin Hong Lee received the degrees incomputer science from the University ofManchester. Before he joined The ChineseUniversity in 1984, he had been with Bur-roughs Machines in Scotland and ICL inManchester. He is now an Associate Profes-sor of the Computer Science & EngineeringDepartment at the Chinese University. Hiscurrent research interests include computerhardware and bioinformatics. He has pub-

lished more than 60 papers in these two fields.

Jin-Feng Wang received the BS degree incomputer science from Hebei Science andTechnology University, Hebei, PR China, in1999, and the MS degree in computer sciencefrom Hebei University, Hebei, PR China, in2003. She is currently working toward the PhDdegree at the Chinese University of Hong Kong.Her current research interests include nonlinearintegrals and bioinformatics.

Eddie Y.T. Ng’s photo and bio are not available.

Henry L.Y. Chan received the graduate degreefrom The Chinese University of Hong Kong andcompleted training at the Prince of WalesHospital, Hong Kong. He is currently a professorin the Department of Medicine and Therapeu-tics, the director of Cheng Suen Man ShookCenter for Hepatitis Research, and the directorof Center for Liver Health of the university. He isthe president of the Hong Kong Association forthe Study of Liver Diseases. He is serving as an

editor in the Journal of Gastroenterology and Hepatology and a memberof the editorial board of Clinical Gastroenterology and Hepatology andAlimentary Pharmacology and Therapeutics. He has board researchinterest including virology of hepatitis B virus, natural history andtreatment of chronic hepatitis B, liver fibrosis, hepatocellular carcinoma,and nonalcoholic fatty liver disease. He has published more than150 peer-reviewed papers, six book chapters, and 135 conferenceabstracts.

Stephen K.W. Tsui is currently a professor inthe Biochemistry Department of the ChineseUniversity of Hong Kong. He is also the directorof the Centre for Microbial Genomics andProteomics and the Hong Kong BioinformaticsCentre. His major research interests are thegenomics and bioinformatics of pathogenicviruses, including hepatitis B virus, SARS-cor-onavirus, and human immunodeficiency virus.

Tony S.K. Mok was trained at the University ofAlberta and subsequently completed his fellow-ship in medical oncology at the PrincessMargaret Hospital in Toronto. After working ina community-based oncology practice in Tor-onto for 7 years, he took up an academicposition at the Department of Clinical Oncologyat the Chinese University of Hong Kong. He hasbeen very active in both clinical and laboratoryresearch in the areas of lung cancer, liver

cancer, and traditional Chinese medicine. He has contributed to over100 international publications including abstracts, original articles, andbook chapters. He speaks frequently at international conferences and isparticularly active in China and Asia Pacific region. He is the founder ofthe Lung Cancer Research Group, one of the first multicenter lungcancer study group in Asia Pacific region. He is the chairman of theHong Kong Cancer Therapy Society and executive committee memberof the Chinese Society of Clinical Oncology. He has also founded theCancer Patient Resource Center and the Cancer Information Hotline inHong Kong. In 2002, he received the Lilly Oncology PharmacogenomicsAward in United States.

Pete Chi-Hang Tse received the MPhil degreein zoology with specialization in molecularbiology from The University of Hong Kong atHong Kong in 1997. He is now a researchassociate at the Institute of Digestive Diseasesand Division of Gastroentology and Hepatol-ogy, Department of Medicine and Therapeutics,Prince of Wales Hospital, The Chinese Uni-versity of Hong Kong. His main researchinterests are molecular virology of hepatitis B

virus and viral phylogenetics.

Joseph Jao-Yiu Sung received the MBBSdegree from the University of Hong Kong in1983. He was conferred the Doctor of Philoso-phy by the University of Calgary and Doctor ofMedicine by the Chinese University of HongKong in 1991 and 1997, respectively. He hasheld fellowships from the Royal College ofPhysicians of Edinburgh, London, Thailand, &Glasgow, the American College of Gastroenter-ology, the Royal Australian College of Physi-

cians, the American Gastroenterological Association, and the HongKong College of Physicians and Academy of Medicine.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

440 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 2, MARCH/APRIL 2011