decisionsupportsystem(dss)forfrauddetectioninhealth...

20
Research Article Decision Support System (DSS) for Fraud Detection in Health Insurance Claims Using Genetic Support Vector Machines (GSVMs) Robert A. Sowah , 1 Marcellinus Kuuboore , 1 Abdul Ofoli, 2 Samuel Kwofie, 3 Louis Asiedu , 4 Koudjo M. Koumadi, 1 and Kwaku O. Apeadu 1 1 Department of Computer Engineering, University of Ghana, PMB 25, Legon, Accra, Ghana 2 Electrical and Computer Engineering Department, University of Tennessee, Chattanooga, TN, USA 3 Department of Biomedical Engineering, University of Ghana, Legon, Accra, Ghana 4 Department of Statistics and Actuarial Science, University of Ghana, Legon, Accra, Ghana Correspondence should be addressed to Robert A. Sowah; [email protected] Received 25 January 2019; Accepted 1 August 2019; Published 2 September 2019 Academic Editor: Kamran Iqbal Copyright © 2019 Robert A. Sowah et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Fraud in health insurance claims has become a significant problem whose rampant growth has deeply affected the global delivery of health services. In addition to financial losses incurred, patients who genuinely need medical care suffer because service providers are not paid on time as a result of delays in the manual vetting of their claims and are therefore unwilling to continue offering their services. Health insurance claims fraud is committed through service providers, insurance subscribers, and in- surance companies. e need for the development of a decision support system (DSS) for accurate, automated claim processing to offset the attendant challenges faced by the National Health Insurance Scheme cannot be overstated. is paper utilized the National Health Insurance Scheme claims dataset obtained from hospitals in Ghana for detecting health insurance fraud and other anomalies. Genetic support vector machines (GSVMs), a novel hybridized data mining and statistical machine learning tool, which provide a set of sophisticated algorithms for the automatic detection of fraudulent claims in these health insurance databases are used. e experimental results have proven that the GSVM possessed better detection and classification performance when applied using SVM kernel classifiers. ree GSVM classifiers were evaluated and their results compared. Experimental results show a significant reduction in computational time on claims processing while increasing classification accuracy via the various SVM classifiers (linear (80.67%), polynomial (81.22%), and radial basis function (RBF) kernel (87.91%). 1. Introduction Low-income countries have made significant development policy frameworks for the sustainability of growth. ese frameworks include healthcare delivery. Ghana is one of the countries which aspired to provide effective and efficient health care. In achieving this noble goal, the National Health Insurance Scheme (NHIS) was established by an Act of Parliament, Act 650, in 2003 [1]. e NHIS, as a social protection initiative, aims at providing financial risk protection against the cost of pri- mary health care for residents of Ghana, and it has replaced the hitherto obnoxious cash and carry system of paying for health care at the point of receiving service. Since its in- troduction, the scheme has grown to become a significant instrument for financing healthcare delivery in Ghana. For effective and efficient implementation, NHIS introduced a tariff as a standardized primary fee for service rendered to its beneficiaries at their affiliated health institutions. is standardized tool was reviewed in January 2007 by the National Health Insurance Authority (NHIA), the governing body of NHIS, to develop a new tariff for the NHIS due to its expansion of service coverage. e new tariff was developed based on a GDRG (Ghana Diagnostic Related Group) system to include various clinical conditions and surgical procedures grouped under eleven Major Diagnostic Hindawi Journal of Engineering Volume 2019, Article ID 1432597, 19 pages https://doi.org/10.1155/2019/1432597

Upload: others

Post on 25-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

Research ArticleDecision Support System (DSS) for Fraud Detection in HealthInsurance Claims Using Genetic Support VectorMachines (GSVMs)

Robert A Sowah 1 Marcellinus Kuuboore 1 Abdul Ofoli2 Samuel Kwofie3

Louis Asiedu 4 Koudjo M Koumadi1 and Kwaku O Apeadu1

1Department of Computer Engineering University of Ghana PMB 25 Legon Accra Ghana2Electrical and Computer Engineering Department University of Tennessee Chattanooga TN USA3Department of Biomedical Engineering University of Ghana Legon Accra Ghana4Department of Statistics and Actuarial Science University of Ghana Legon Accra Ghana

Correspondence should be addressed to Robert A Sowah rasowahugedugh

Received 25 January 2019 Accepted 1 August 2019 Published 2 September 2019

Academic Editor Kamran Iqbal

Copyright copy 2019 Robert A Sowah et al (is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Fraud in health insurance claims has become a significant problem whose rampant growth has deeply affected the global deliveryof health services In addition to financial losses incurred patients who genuinely need medical care suffer because serviceproviders are not paid on time as a result of delays in the manual vetting of their claims and are therefore unwilling to continueoffering their services Health insurance claims fraud is committed through service providers insurance subscribers and in-surance companies(e need for the development of a decision support system (DSS) for accurate automated claim processing tooffset the attendant challenges faced by the National Health Insurance Scheme cannot be overstated (is paper utilized theNational Health Insurance Scheme claims dataset obtained from hospitals in Ghana for detecting health insurance fraud and otheranomalies Genetic support vector machines (GSVMs) a novel hybridized data mining and statistical machine learning toolwhich provide a set of sophisticated algorithms for the automatic detection of fraudulent claims in these health insurancedatabases are used(e experimental results have proven that the GSVMpossessed better detection and classification performancewhen applied using SVM kernel classifiers (ree GSVM classifiers were evaluated and their results compared Experimentalresults show a significant reduction in computational time on claims processing while increasing classification accuracy via thevarious SVM classifiers (linear (8067) polynomial (8122) and radial basis function (RBF) kernel (8791)

1 Introduction

Low-income countries have made significant developmentpolicy frameworks for the sustainability of growth (eseframeworks include healthcare delivery Ghana is one of thecountries which aspired to provide effective and efficienthealth care In achieving this noble goal the National HealthInsurance Scheme (NHIS) was established by an Act ofParliament Act 650 in 2003 [1]

(e NHIS as a social protection initiative aims atproviding financial risk protection against the cost of pri-mary health care for residents of Ghana and it has replacedthe hitherto obnoxious cash and carry system of paying for

health care at the point of receiving service Since its in-troduction the scheme has grown to become a significantinstrument for financing healthcare delivery in Ghana Foreffective and efficient implementation NHIS introduced atariff as a standardized primary fee for service rendered to itsbeneficiaries at their affiliated health institutions (isstandardized tool was reviewed in January 2007 by theNational Health Insurance Authority (NHIA) the governingbody of NHIS to develop a new tariff for the NHIS due to itsexpansion of service coverage (e new tariff was developedbased on a GDRG (Ghana Diagnostic Related Group)system to include various clinical conditions and surgicalprocedures grouped under eleven Major Diagnostic

HindawiJournal of EngineeringVolume 2019 Article ID 1432597 19 pageshttpsdoiorg10115520191432597

Categories (MDC) or clinical specialties namely AdultMedicine Pediatrics Adult Surgery Pediatrics Surgery EarNose and(roat (ENT) Obstetrics and Gynecology DentalOphthalmology Orthopedics Reconstructive Surgery andOut-Patientsrsquo Department (OPD) [2] (ese specialtiesprovide a guide to the claim adjudication process and theoperational mechanism for reporting claims as well as de-termine the reimbursement process and create standards ofoperation between NHIS and service providers [2]

(e GDRG code structure uses seven alphanumericcharacters (e first four characters represent the MDC orclinical specialty (e next two characters are numbers torepresent the number of GDRG within MDC (e lastcharacter (A or C) represent the age categories An ldquoArdquorepresents those greater than or equal to 12 years and Cstands for those less than 12 years

(e World Health Organization (WHO) provided anInternational Classification of Diseases (ICD-10) to meet therequirements for claim submission [3 4] but NHIS utilizedthe GDRG codes since they have full control over themHence the GDRG codes are used to develop the frauddetection model

A claim is a detailed invoice that service providers sendto the health insurer which shows exactly what services apatient or patients received at the point of healthcare servicedelivery Claim processing is the major challenge of pro-viders under the Health Insurance Scheme (HIS) globallydue to the excessive fraud in submitted claims and gaming ofthe system through well-coordinated schemes to siphonmoney from its coffers [5ndash9]

Fraud in health care is classified into three categoriesnamely (1) service provider (hospitals and physicians)fraud (2) beneficiary (patients) fraud and (3) insurer fraud[8 10ndash13] Several types of fraud schemes form the basis ofthis problem in health insurance programs worldwide(eseare (1) billing for services not rendered (identity theft andphantom billing) (2) upcoding of services and items(upcoding) (3) duplicate billing (4) unbundling of claims(unbundlingcreative billing) (5) medically unnecessaryservices (bill padding) (6) excessive services (bill padding)(7) kickbacks (8) impersonation (9) ganging (10) illegalcash exchange for prescription (11) frivolous use of service(12) insurance carriersrsquo fraud (13) falsifying reimbursement(14) upcoding of service and (14) insurance subscribersrsquofraud among others [9 13ndash19]

It was estimated conservatively that at least 3 or morethan $60 billion of the USrsquos annual healthcare expenditurewas lost due to fraud Other estimates by government andlaw enforcement agencies placed this loss as high as 10 or$170 billion [9 12] In addition to financial loss fraud alsoseverely hinders the US healthcare system from providingquality care to legitimate beneficiaries [9] Hence effectivefraud detection is essential for improving the quality andreducing the cost of healthcare services

(e National Health Care Antifraud Association reportin [12 20] intimated that healthcare fraud strips nearly $70billion from the healthcare industry each year In response tothese realities the Health Insurance Portability and Ac-countability Act of 1996 (HIPAA) specifically established

healthcare fraud as a federal criminal offense with theprimary crime carrying a federal prison term of up to10 years in addition to significant financial penalties [8 21]

(is paper presents the hybridized approach of com-bining genetic algorithms and support vector machines(GSVMs) to solve the health insurance claim classificationproblem and eliminate fraudulent claims while minimizingconversion and labour costs through automated claimprocessing (e significant contributions of this paper are asfollows (1) analysis of existing data mining and machinelearning techniques (decision tree Bayesian networksNaıvendashBayes classifier and support vector machines) forfraud detection (2) development of a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines (3) design development anddeployment of a decision support system (DSS) which in-corporates the fraudulent claim detection model businessintelligence and knowledge representation for claims pro-cessing at NHIS Ghana (4) development of a user-friendlygraphical user interface (GUI) for the intelligent fraud de-tection system and (5) evaluation of the health insuranceclaims fraud detection system using Ghana National HealthInsurances Subscribersrsquo data from different hospitals

(e outline of the paper is as follows Section 1 presentsthe introduction and problem statement with research ob-jectives Section 2 outlines the systematic literature review onvarious machine learning and data mining techniques forhealth insurance claims fraud detection Section 3 gives thetheoretical and mathematical foundations of genetic algo-rithms (GA) support vector machines (SVMs) and thehybrid genetic support vector machines (GSVMs) in com-bating this global phenomenon Section 4 provides theproposed methodology for the GSVM fraud detection sys-tem its design processes and development Section 5comprises the design and implementation of the geneticsupport vector machines while Section 6 presents the keyfindings of the research with conclusions and recommen-dations for future work

2 Literature Review

Researching into health insurance claims fraud domainrequires a clear distinctive view on what fraud is because it issometimes lumped together with abuse and waste Howeverfraud and abuse refer to a situation where healthcare serviceis paid for but not provided or reimbursement of funds ismade to third-party insurance companies Fraud and abuseare further explained as healthcare providers receivingkickbacks patients seeking treatments that are potentiallyharmful to them (such as seeking drugs to satisfy addic-tions) and the prescription of services known to be un-necessary [12 17ndash19] Health insurance fraud is anintentional act of deceiving concealing or misrepresentinginformation that results in healthcare benefits being paid toan individual or group

Health insurance fraud detection involves accountauditing and detective investigation Careful accountauditing can reveal suspicious providers and policyholdersIdeally it is best to audit all claims one-by-one However

2 Journal of Engineering

auditing all claims is not feasible by any practical meansFurthermore it is challenging to audit providers withoutconcrete smoking clues A practical approach is to developshortlists for scrutiny and perform auditing on providersand patients in the shortlists Various analytical techniquescan be employed in developing audit shortlists

(e most common fraud detection techniques reportedthrough the literature include the use of machine learningdata mining AI and statistical methods (e most cost-saving model using the NaıvendashBayes algorithm was used tocreate a subsample of 20 claims consisting of 400 objectswhere 50 of objects were classified as fraud and the other50 classified as legal which eventually does not give a clearpicture of the decision if compared to other classifiers [22]

(e integration of multiple traditional methods hasemerged as a new research area in combating fraud (isapproach could be supervised unsupervised or both forone method to depend on the other for classification Onemethod may be used as a preprocessing step to modify thedata in preparation for classification [9 23 24] or at a lowerlevel the individual steps of the algorithms can be inter-twined to create something fundamentally original Hybridmethods can be used to tailor solutions to a particularproblem domain Different aspects of performance can bespecifically targeted including classification ability ease ofuse and computational efficiency [14]

Fuzzy logic was combined with neural networks to assessand automatically classify medical claims [14] (e conceptof data warehousing for data mining purposes in health carewas applied to develop an electronic fraud detection ap-plication to review service providers on behavioral heuristicsand compared to similar service providers AustraliarsquosHealth Insurance Commission has explored the onlinediscounting learning algorithm to identify rare cases inpathology insurance data [10 25ndash27]

Researchers in Taiwan developed a detection modelbased on process mining that systematically identifiedpractices derived from clinical pathways to detect fraudulentclaims [8]

Results published in [28 29] used Benfordrsquos Law Dis-tributions to detect anomalies in claims reimbursements inCanada Despite the detection of some anomalies and ir-regularities the ability to identify suspected claims is verylimited to health insurance claim fraud detection since itapplies to service providers with payer-fixed prices

Neural networks were used to develop an application fordetecting medical abuse and fraud for a private health in-surance scheme in Chile [30] (e ability to process claimson real-time basis accounts for the innovative nature of thismethod (e application of association rule mining to ex-amine billing patterns within a particular specialist group todetect these suspicious claims and potential fraudulent in-dividuals was incorporated in [9 22 30]

3 Mathematical Foundations for GeneticSupport Vector Machines

In the 1960s John Holland invented genetic algorithms byinvolving a simulation of Darwinian survival of the fittest as

well as the processes of crossover mutation and inversionthat occurs in other genetics Hollandrsquos inversion demon-strated that under certain assumptions GA indeed achievesan optimal balance [31ndash34] In contrast with evolutionstrategies and evolutionary programming Hollandrsquos originalgoal was not to design algorithms to solve specific problemsbut rather to formally study the phenomenon of adaptationas it occurs in nature and to develop ways in which themechanisms of natural adaptation might be imported intocomputer systems Moreover Holland was the first to at-tempt to put computational evolution on a firm theoreticalfooting [35]

Genetic algorithms operate through three main opera-tors namely (1) reproduction (2) crossover and (3) mu-tation A typical genetic algorithm requires (1) a geneticrepresentation of the solution domain and (2) a fitnessfunction to evaluate the solution domain [31ndash34]

Reproduction is controlled by crossover and mutationoperators Crossover is the process whereby genes are se-lected from the parent chromosomes and new offsprings areproduced A mutation is designed to add diversity to thepopulation and ensure the possibility of exploring the entiresearch space It replaces the values of some randomly se-lected genes of a chromosome by some arbitrary new values[33 35]

During the reproduction stage an individual is assigneda fitness value derived from its raw performance measuregiven by the objective function

Support vector machine (SVM) as a statistical machinelearning theory was introduced in 1995 by Vapnik and Corteas an alternative technique for polynomial radial functionand multilayer perceptron classifiers in which the weights ofthe neurons are found by solving quadratic programming(QP) problem with linearity inequality and equality con-straints rather than by solving a nonconvex unconstrainedminimization problem [36ndash39] As a novel machine learningtechnique for binary classification regression analysis facedetection text categorization in bioinformatics and datamining and outlier detection SVMs face challenges whenthe dataset is very large due to the dense nature and memoryrequirement of the quadratic form of the dataset HoweverSVM is an excellent example of supervised learning that triesto maximize the generalization by maximizing the marginand supports nonlinear separation using kernelization [40]SVM tries to avoid overfitting and underfitting (e marginin SVM denotes the distance from the boundary to theclosest data points in the feature space

Given the claims training dataset correspondingly to xn

Rn isin F in the feature space F the calculated linear hyper-plane dividing them into two labelled classes yi (fraud andlegal) can be mathematically obtained as

ωTxi + b 0 ω isin Rn

b isin R (1)

Assuming the training dataset is correctly classified asshown in Figure 1

(is means that the SVC computes the hyperplane tomaximize the margin separating the classes (legal claims andfraud claims)

Journal of Engineering 3

In the simplest linear form an SVC is a hyperplane thatseparates the legal claims from the false claims with amaximum margin Finding this hyperplane involvesobtaining two hyperplanes parallel to it as shown in Figure 1above with an equal distance to the maximum margin If allthe training dataset satisfies the constraints as follows

ωTxi + ble 1 foryi +1

ωTxi + bge minus 1 foryi minus 1

⎧⎨

⎩ (2)

where ω is the normal to the hyperplane |b|ω is theperpendicular distance from the hyperplane to the originand ω is the Euclidean norm of ω (e separating hy-perplane is defined by the plane ωTxi + b 0 and the aboveconstraints in (2) are combined to form

yi ωTxi + b1113872 1113873ge 1 (3)

(e pair of the hyperplanes that gives the maximummargin (c) can be found by minimizing ω2 subject toconstraint in (9) (is leads to a quadratic optimizationproblem formulated as

Minimize f(ω b) ω2

2

subject to yi ωTxi + b( 1113857ge 1 forall i 1 n

(4)

(is problem is reformulated by introducing Lagrangemultipliers αi(i 1 n ) for each constraint and sub-tracting them from the function f(x) ωTxi + b

(is results in establishing the primal Lagrangianfunction

LP(ω b α) ||ω||2

2+ 1113944

n

i1αi i yi ωT

xi + b1113966 11139671113872 11138731113872 1113873

foralli 1 n

(5)

Taking the partial derivatives of LP(ω b α) with respectto ω bamp α respectively and applying the duality theoryyields

zLP

zω 0⟹ω 1113944

n

i1αiyixi

zLP

zb 0⟹ b 1113944

n

i1αiyi

(6)

(e problem defined in (5) is a quadratic optimization(QP) problem Maximizing the primal problem LP withrespect to αi subject to the constraints that the gradient of LP

with respect to w and b vanish and that αi ge 0 gives thefollowing two conditions

ω 1113944

n

i1αiyixi

1113944

n

i1αiyi 0

(7)

Substituting these constraints gives the dual formulationof the Lagrangian

Maximizeα

LD(ω b α) 1113944n

i1αi minus

12

1113944

n

i11113944

n

j1αiαjyiyj xixj1113872 1113873

subject to 1113944

n

i1αiyi 0 αi ge 0 i 1 n

(8)

But the values of αi ω and b are obtained from theserespective equations namely

ω 1113944n

i1αiyixi

b 12

Miniyi +1ωT

xi + Maxiyi minus 1ωT

xi1113872 1113873

(9)

Also the Lagrange multiplier is computed using

αi 1 minus yi ωTxi + b1113872 11138731113872 1113873 0 (10)

Hence this dual Lagrangian LD is maximized with re-spect to its nonnegative αi to give a standard quadraticoptimization problem (e respective training vectors arecalled support vectorsWith the input dataset xi as a nonzeroLagrangian multiplier αi

yi ωTxi + b1113872 1113873 1 (11)

(e equation above gives the support vectors (SVs)Despite that the SVM classifier can only have a linear

hyperplane as its decision surface its formulation can beextended to build a nonlinear SVM SVMs with nonlineardecision surfaces can classify nonlinearly separable data byintroducing a soft margin hyperplane as shown in Figure 2

Introducing the slack variable into the constraints yields

ωTxi + bge 1 minus ξi foryi +1

ωTxi + bge minus 1 + ξi foryi minus 1

ξi ge 0 foralli

(12)

Legitimateclaims

Misclassifiedpoint

Support vector

b

Margin = 2 (radicwTw)

K (xi xj) = ϕT(xi)ϕ(xj)

Support vector

W

ξ = 0

ξ lt 1

ξ gt 1

Fraudulentclaims

WTϕ(x) + b = ndash1

WTϕ(x) + b = 0

WTϕ(x) + b = +1

Figure 1 Standard formulation of SVM

4 Journal of Engineering

(ese slack variables help to find the hyperplane thatprovides the minimum number of training errorsModifying equation (4) to include the slack variableyields

Minimizeαb ξi

||ω||2

2+ C 1113944

n

i1ξi

subject to αi 1 minus yi ωTxi + b( 1113857( 1113857 + ξi minus 1ge 0 ξi ge 0

(13)

(eparameter C is a regularization parameter that tradesoff the wide margin with a small number of margin failures(e parameter C is finite (e larger the C value the moresignificant the error

(e KarushndashKuhnndashTucker (KKT) conditions are nec-essary to ensure optimality of the solution to a nonlinearprogramming problem

yi ωTxi + b1113872 11138731113872 1113873 minus 1ge 0 i 1 2 l foralli

αi yi ωTxi + b1113872 11138731113872 1113873 minus 1 0 αi ge 0 foralli

(14)

(e KKT conditions for the primal problem are used inthe nonseparable case after which the primal Lagrangianbecomes

LP ||ω||2

2+ c 1113944

n

i1ξi minus 1113944

j

i1αi yi ωT

xi + b1113966 11139671113872 1113873 minus 1 + ξi1113872 1113873

minus 1113944n

i1βiξi

(15)

With βi as the Lagrange multipliers to enforce positivityof the slack variables (ξi) and applying the KKT conditionsto the primal problem yields

zLP

zωu

ωu minus 1113944n

i1αiyixiu 0

zLP

zwu

C minus αiyi 0

zLP

zξi

C minus αi minus βi 0

αi yi ωTxi + b1113966 11139671113872 1113873 minus 1 + ξi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi ge 0

αi βi ξi ge 0 C ξi ge 0

i 1 2 n and u 1 2 d

zLP

zωu

ωu minus 1113944n

i1αiyixiu 0

zLP

zwu

C minus αiyi 0

(16)

where the parameter d represents the dimension of thedataset

Observing the expressions obtained above after applyingKKTconditions yields ξi 0 for αi ltC since βi C minus αi ne 0(is implies that any training point for which 0 lt αi ltC willbe taken to compute for b as a data point that does not crossthe boundary

αi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi gt 0

(17)

(is does not participate in the derivation of the sepa-rating function with αi C and ξi gt 0

αi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi 0

(18)

Nonlinear SVM maps the training samples from theinput space into a higher-dimensional feature space via akernel mapping function F In the dual Lagrangian functionthe inner products are replaced by the kernel function

Φ xi( 1113857 middotΦ xj1113872 11138731113872 1113873 k xi xj1113872 1113873 (19)

Effective kernels are used in finding the separating hy-perplane without high computational resources (e non-linear SVM dual Lagrangian

LD(α) 1113944

n

i1αi minus

12

1113944

n

i11113944

n

j1αiαjyiyjk xi middot xj1113872 1113873 (20)

subject to

1113944

n

i1αiyi 0 0le αi i 1 n (21)

ξ = 0

Support vectorSupport vector

ξ lt 1

b

W

Misclassifiedpoint

ξ gt 1 Margin

Figure 2 Linear separating hyperplanes for the nonseparable caseof SVC by introducing the slack variable (ξ)

Journal of Engineering 5

(is is like that of the generalized linear case(e nonlinear SVM separating hyperplane is illustrated

in Figure 3 with the support vectors class labels andmargin(is model can be solved by the method of optimization

in the separable case (erefore the optimal hyperplane hasthe following form

f(x) 1113944n

i1αiyik xi x( 1113857 + b (22)

where b is the decision boundary from the origin Henceseparating newly arrived dataset x implies that

g(x) sign(f(x)) (23)

However feasible kernels must be symmetrical ie thematrix K with the component k(xi xj) is positive semi-definite and satisfies Mercerrsquos condition given in [39 40](e summarized kernel functions considered in this workare given in Table 1

(ese kernels satisfied Mercerrsquos condition with RBFor Gaussian kernel which is the widely used kernelfunction from the literature (e RBF has an advantage ofadding a single free parameter cgt 0 which controls thewidth of the RBF kernel as c 12σ2 where σ2 is thevariance of the resulting Gaussian hypersphere (elinear kernel is given as k(xi xj) xi middot xj Consequentlythe training of SVMs used the solution of the QP opti-mization problem (e above mathematical formulationsform the foundation for the development and de-ployment of genetic support vector machines as thedecision support tool for detecting and classifying healthinsurance fraudulent claims In recent times decision-making activities of knowledge-intensive enterprisesdepend holistically on the successful classification of datapatterns despite time and computational resources re-quired to achieve the results due to the complexity as-sociated with the dataset and its size

4 Methodology for GSVM Fraud Detection

(e systematic approach adopted for the design and de-velopment of genetic support vector machines for healthinsurance claims fraud detection is presented in the

conceptual framework in Figure 4 and the flow chartimplementation in Figure 5

(e conceptual framework incorporates the designand development of key algorithms that enable submittedclaims data to be analysed and a model to be developedfor testing and validation (e flow chart presents thealgorithm implemented based on theoretical foundationsin incorporating genetic algorithms and support vectormachines two useful machine learning algorithms nec-essary for fraud detection (eir combined use in thedetection process generates accurate results (e meth-odology for the design and development of geneticsupport vector machines as presented above consists ofthree (3) significant steps namely (1) data preprocessing(2) classification engine development and (3) datapostprocessing

41 Data Preprocessing (e data preprocessing is the firstsignificant stage in the development of the fraud detectionsystem(is stage involves the use of data mining techniquesto transform the data from its raw form into the requiredformat to be used by the SVC for the detection and iden-tification of health insurance claims fraud

(e data preprocessing stage involves the removal ofunwanted customers missing records and data smooth-ening (is is to make sure that only useful and relevantinformation is extracted for the next process

Before the preprocessing the data were imported fromMS Excel CSV format into MySQL to a created databasecalled NHIS (e imported data include the electronicHealth Insurance Claims (e-HIC) data and the HIC tariffdatasets as tables imported into the NHIS (e e-HIC datapreprocessing involves the following steps (1) claims datafiltering and selection (2) feature selection and extractionand (3) feature adjustment

Class 1Margin

Class 2

Support vectors

Hyperplane

Figure 3 Nonlinear separating hyperplane for the nonseparable case of SVM

Table 1 Summarized kernel functions used

Kernel name Parameters Kernel functionRadial basis function(RBF) c isin R k(xi xj) eminus cxi minus xj2

Polynomial function c isin R d isin N k(xi xj) (xi middot xj + c)d

6 Journal of Engineering

(e WEKA machine learning and knowledge analysisenvironment were used for feature selection and extractionwhile the data processing codes are written in the MATLABtechnical computing environment (e developed MAT-LAB-based decision support engine was connected viaMYSQL using the script shown in Figure 6

Preprocessing of the raw data involves claims cost val-idity checks(e tariff dataset consists of the approved tariffsfor each diagnostic-related group which was strictlyenforced to clean the data before further processing Claimsare partitioned into two namely (1) claims with the validand approved cost within each DRG and (2) claims withinvalid costs (those above the approved tariffs within eachDRG)

With the recent increase in the volume of real datasetand dimensionality of the claims data there is the urgentneed for a faster more reliable and cost-effective datamining technique for classification models (e data miningtechniques require the extraction of a smaller and optimizedset of features that can be obtained by removing largelyredundant irrelevant and unnecessary features for the classprediction [41]

Feature selection algorithms are utilized to extract aminimal subset of attributes such that the resulting prob-ability distribution of data classes is close to the originaldistribution obtained using all attributes Based on the ideaof survival of the fittest a new population is constructed tocomply with fittest rules in the current population as well asthe offspring of these rules Offsprings are generated byapplying genetic operators such as crossover and mutation(e process of offspring generation continues until it evolvesa population N where every rule in N satisfies the fitnessthreshold With an initial population of 20 instances gen-eration continued till the 20th generation with crossoverprobability of 06 and mutation probability of 0033(e selected features based on genetic algorithms are

ldquoAttendance daterdquo ldquoHospital coderdquo ldquoGDRG coderdquo ldquoServicebillrdquo and ldquoDrug billrdquo (ese are the features selectedextracted and used as the basis for the optimization problemformulated below

Minimize Totalcos t f Sbill Dbill( 1113857

subject to 1113944

n

i1Sibill1113872 1113873le Gtariff forall i i 1 2 n

1113944

n

i1Dibill1113872 1113873le Dtariff forallj j 1 2 n

Sbill is the Service bill

Dbill is theDrug bill(24)

(e GA e-HIC dataset is subjected to SVM trainingusing 70 of the dataset and 30 for testing as depicted inFigure 7

(e e-HIC dataset which passes the preprocessing stagethat is the valid claims was used for SVM training andtesting(e best data those that meet the genetic algorithmrsquoscriteria are classified first Each record of this dataset isclassified as either ldquoFraudulent Billsrdquo or ldquoLegal Billsrdquo

(e same SVM training and the testing dataset areapplied to the SVM algorithm for its performance analysis(e inbuilt MATLAB code for SVM classifiers was in-tegrated as one function for linear polynomial and RBFkernels (e claim datasets were partitioned for the classifiertraining testing and validation 70 of the dataset was usedfor training and 30 used for testing (e linear poly-nomial and radial basis function SVM classification kernelswere used with ten-fold cross validation for each kernel andthe results averaged For the polynomial classification kernela cubic polynomial was used (e RBF classification kernelused the SMO method [40] (is method ensures the

Fraud detection classifier

Duplicated claims Upcoded claims Unbundled claims Uncovered claims

SuccessLegal claim Fraudulent claim

End GSVMoptimization

NoYes

Valid claims

Fraud detectionmodel

Figure 4 Conceptual model design and development of the genetic support vector machines

Journal of Engineering 7

handling of large data sizes as it does data transformationthrough kernelization After running many instances andvarying parameters for RBF a variance of 09 gave betterresults as it corresponded well with the datasets for theclassification After each classification the correct rate iscalculated and the confusion matrix extracted (e confu-sion matrix gives a count for the true legal true fraudulentfalse legal false fraudulent and inconclusive bills

(i) True legal bills this consists of the number of ldquoLegalBillsrdquo which were correctly classified as ldquoLegal Billsrdquoby the classifier

(ii) True fraudulent bills this consists of the number ofldquoFraudulent Billsrdquo which were correctly classified asldquoFraudulent Billsrdquo by the classifier

(iii) False legal bills this consists of the bills classified asldquoLegal Billsrdquo even though they are not (at is theseare wrongly classified as ldquoLegal Billsrdquo by the kernelused

(iv) False fraudulent bills the classifier also wronglyclassified bills as fraudulent (e confusion matrixgives a count of these wrongly or incorrectly clas-sified bills

Data

Validation

Feature subset selection byRoulette wheel

Training SVM classifier

Trained SVMclassifier

Success

Evaluate fitness

Recombination

Crossover

Mutation

Population

Generation

Optimized SVMhyperparameter (C y) store in PPD

Frauddetection

model

Yes

No

Training dataset Testing dataset

Testing datasetwith feature

selection subset

Training datasetwith feature

selection subset

Figure 5 Flow chart for design and development of the genetic support vector machines

8 Journal of Engineering

(v) Inconclusive bills these consist of nonclassifiedbills

(e correct rate this is calculated as the total number ofcorrectly classified bills namely the true legal bills and truefraudulent bills divided by the total number of bills used forthe classification

correct rate number of TLB + number of TFB

total number of bills (TB) (25)

where TLB True Legal Bills TFB True Fraudulent Bills

accuracy (1 minus Error) TP + TN

TP + TN + FP + FN Pr(C)

(26)

(e probability of a correct classification

411 Sensitivity (is is the statistical measure of the pro-portion of actual fraudulent claims which are correctlydetected

sensitivity TP

TP + FNTPPP

(27)

412 Specificity (is is the statistical measure of the pro-portion of negative fraudulent claims which are correctlyclassified

specificity TN

TN + FP

TNNP

(28)

42 GSVM Fraud Detection System Implementation andTesting (e decision support system comprises four mainmodules integrated together namely (1) algorithm imple-mentation using MATLAB technical computing platform(2) development of graphical user interface (GUI) for theHIC fraud detection system which consists of uploading andprocessing of claims management (3) system administratormanagement and (4) postprocessing of detection andclassification results

Figure 6 MATLAB-based decision support engine connection to the database

SVMtraining

and testingdataset

GAe-HICdata

Creation ofclaims record

database

Claimsfiltering and

selection

Featureselection

andextraction

Featureadjustment

Datanormalization

Datapreprocessing

Figure 7 Data preprocessing for SVM training and testing

Journal of Engineering 9

(e front end of the detection system was developedusing XAMPP a free and open-source cross-platform webserver solution stack package developed by Apache Friends[42] consisting mainly of the Apache HTTP ServerMariaDB database and interpreters for scripts written in the

PHP and Perl programming languages [42] XAMPP standsfor Cross-Platform (X) Apache (A) MariaDB (M) PHP (P)and Perl (P) (e Health Insurance Claims Fraud DetectionSystem (HICFDS) was developed using MATLAB technicalcomputing environment with the capability to connect to an

MySQL database

Autocreation of resultsdatabase

Data flowchart for NHIS claims fraud detectionsystem process

Upload in GUI

Model amp GSVMalgorithm

Detected results

DEVELOPED GUI

Engine

NHIS claims dataset

NHIS FRAUD DETECTION SYSTEM USING GSVM

Exploratorydata

analysis

Figure 8 System implementation architecture for HICFDS

Figure 9 Detection results control portal interface

10 Journal of Engineering

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

Categories (MDC) or clinical specialties namely AdultMedicine Pediatrics Adult Surgery Pediatrics Surgery EarNose and(roat (ENT) Obstetrics and Gynecology DentalOphthalmology Orthopedics Reconstructive Surgery andOut-Patientsrsquo Department (OPD) [2] (ese specialtiesprovide a guide to the claim adjudication process and theoperational mechanism for reporting claims as well as de-termine the reimbursement process and create standards ofoperation between NHIS and service providers [2]

(e GDRG code structure uses seven alphanumericcharacters (e first four characters represent the MDC orclinical specialty (e next two characters are numbers torepresent the number of GDRG within MDC (e lastcharacter (A or C) represent the age categories An ldquoArdquorepresents those greater than or equal to 12 years and Cstands for those less than 12 years

(e World Health Organization (WHO) provided anInternational Classification of Diseases (ICD-10) to meet therequirements for claim submission [3 4] but NHIS utilizedthe GDRG codes since they have full control over themHence the GDRG codes are used to develop the frauddetection model

A claim is a detailed invoice that service providers sendto the health insurer which shows exactly what services apatient or patients received at the point of healthcare servicedelivery Claim processing is the major challenge of pro-viders under the Health Insurance Scheme (HIS) globallydue to the excessive fraud in submitted claims and gaming ofthe system through well-coordinated schemes to siphonmoney from its coffers [5ndash9]

Fraud in health care is classified into three categoriesnamely (1) service provider (hospitals and physicians)fraud (2) beneficiary (patients) fraud and (3) insurer fraud[8 10ndash13] Several types of fraud schemes form the basis ofthis problem in health insurance programs worldwide(eseare (1) billing for services not rendered (identity theft andphantom billing) (2) upcoding of services and items(upcoding) (3) duplicate billing (4) unbundling of claims(unbundlingcreative billing) (5) medically unnecessaryservices (bill padding) (6) excessive services (bill padding)(7) kickbacks (8) impersonation (9) ganging (10) illegalcash exchange for prescription (11) frivolous use of service(12) insurance carriersrsquo fraud (13) falsifying reimbursement(14) upcoding of service and (14) insurance subscribersrsquofraud among others [9 13ndash19]

It was estimated conservatively that at least 3 or morethan $60 billion of the USrsquos annual healthcare expenditurewas lost due to fraud Other estimates by government andlaw enforcement agencies placed this loss as high as 10 or$170 billion [9 12] In addition to financial loss fraud alsoseverely hinders the US healthcare system from providingquality care to legitimate beneficiaries [9] Hence effectivefraud detection is essential for improving the quality andreducing the cost of healthcare services

(e National Health Care Antifraud Association reportin [12 20] intimated that healthcare fraud strips nearly $70billion from the healthcare industry each year In response tothese realities the Health Insurance Portability and Ac-countability Act of 1996 (HIPAA) specifically established

healthcare fraud as a federal criminal offense with theprimary crime carrying a federal prison term of up to10 years in addition to significant financial penalties [8 21]

(is paper presents the hybridized approach of com-bining genetic algorithms and support vector machines(GSVMs) to solve the health insurance claim classificationproblem and eliminate fraudulent claims while minimizingconversion and labour costs through automated claimprocessing (e significant contributions of this paper are asfollows (1) analysis of existing data mining and machinelearning techniques (decision tree Bayesian networksNaıvendashBayes classifier and support vector machines) forfraud detection (2) development of a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines (3) design development anddeployment of a decision support system (DSS) which in-corporates the fraudulent claim detection model businessintelligence and knowledge representation for claims pro-cessing at NHIS Ghana (4) development of a user-friendlygraphical user interface (GUI) for the intelligent fraud de-tection system and (5) evaluation of the health insuranceclaims fraud detection system using Ghana National HealthInsurances Subscribersrsquo data from different hospitals

(e outline of the paper is as follows Section 1 presentsthe introduction and problem statement with research ob-jectives Section 2 outlines the systematic literature review onvarious machine learning and data mining techniques forhealth insurance claims fraud detection Section 3 gives thetheoretical and mathematical foundations of genetic algo-rithms (GA) support vector machines (SVMs) and thehybrid genetic support vector machines (GSVMs) in com-bating this global phenomenon Section 4 provides theproposed methodology for the GSVM fraud detection sys-tem its design processes and development Section 5comprises the design and implementation of the geneticsupport vector machines while Section 6 presents the keyfindings of the research with conclusions and recommen-dations for future work

2 Literature Review

Researching into health insurance claims fraud domainrequires a clear distinctive view on what fraud is because it issometimes lumped together with abuse and waste Howeverfraud and abuse refer to a situation where healthcare serviceis paid for but not provided or reimbursement of funds ismade to third-party insurance companies Fraud and abuseare further explained as healthcare providers receivingkickbacks patients seeking treatments that are potentiallyharmful to them (such as seeking drugs to satisfy addic-tions) and the prescription of services known to be un-necessary [12 17ndash19] Health insurance fraud is anintentional act of deceiving concealing or misrepresentinginformation that results in healthcare benefits being paid toan individual or group

Health insurance fraud detection involves accountauditing and detective investigation Careful accountauditing can reveal suspicious providers and policyholdersIdeally it is best to audit all claims one-by-one However

2 Journal of Engineering

auditing all claims is not feasible by any practical meansFurthermore it is challenging to audit providers withoutconcrete smoking clues A practical approach is to developshortlists for scrutiny and perform auditing on providersand patients in the shortlists Various analytical techniquescan be employed in developing audit shortlists

(e most common fraud detection techniques reportedthrough the literature include the use of machine learningdata mining AI and statistical methods (e most cost-saving model using the NaıvendashBayes algorithm was used tocreate a subsample of 20 claims consisting of 400 objectswhere 50 of objects were classified as fraud and the other50 classified as legal which eventually does not give a clearpicture of the decision if compared to other classifiers [22]

(e integration of multiple traditional methods hasemerged as a new research area in combating fraud (isapproach could be supervised unsupervised or both forone method to depend on the other for classification Onemethod may be used as a preprocessing step to modify thedata in preparation for classification [9 23 24] or at a lowerlevel the individual steps of the algorithms can be inter-twined to create something fundamentally original Hybridmethods can be used to tailor solutions to a particularproblem domain Different aspects of performance can bespecifically targeted including classification ability ease ofuse and computational efficiency [14]

Fuzzy logic was combined with neural networks to assessand automatically classify medical claims [14] (e conceptof data warehousing for data mining purposes in health carewas applied to develop an electronic fraud detection ap-plication to review service providers on behavioral heuristicsand compared to similar service providers AustraliarsquosHealth Insurance Commission has explored the onlinediscounting learning algorithm to identify rare cases inpathology insurance data [10 25ndash27]

Researchers in Taiwan developed a detection modelbased on process mining that systematically identifiedpractices derived from clinical pathways to detect fraudulentclaims [8]

Results published in [28 29] used Benfordrsquos Law Dis-tributions to detect anomalies in claims reimbursements inCanada Despite the detection of some anomalies and ir-regularities the ability to identify suspected claims is verylimited to health insurance claim fraud detection since itapplies to service providers with payer-fixed prices

Neural networks were used to develop an application fordetecting medical abuse and fraud for a private health in-surance scheme in Chile [30] (e ability to process claimson real-time basis accounts for the innovative nature of thismethod (e application of association rule mining to ex-amine billing patterns within a particular specialist group todetect these suspicious claims and potential fraudulent in-dividuals was incorporated in [9 22 30]

3 Mathematical Foundations for GeneticSupport Vector Machines

In the 1960s John Holland invented genetic algorithms byinvolving a simulation of Darwinian survival of the fittest as

well as the processes of crossover mutation and inversionthat occurs in other genetics Hollandrsquos inversion demon-strated that under certain assumptions GA indeed achievesan optimal balance [31ndash34] In contrast with evolutionstrategies and evolutionary programming Hollandrsquos originalgoal was not to design algorithms to solve specific problemsbut rather to formally study the phenomenon of adaptationas it occurs in nature and to develop ways in which themechanisms of natural adaptation might be imported intocomputer systems Moreover Holland was the first to at-tempt to put computational evolution on a firm theoreticalfooting [35]

Genetic algorithms operate through three main opera-tors namely (1) reproduction (2) crossover and (3) mu-tation A typical genetic algorithm requires (1) a geneticrepresentation of the solution domain and (2) a fitnessfunction to evaluate the solution domain [31ndash34]

Reproduction is controlled by crossover and mutationoperators Crossover is the process whereby genes are se-lected from the parent chromosomes and new offsprings areproduced A mutation is designed to add diversity to thepopulation and ensure the possibility of exploring the entiresearch space It replaces the values of some randomly se-lected genes of a chromosome by some arbitrary new values[33 35]

During the reproduction stage an individual is assigneda fitness value derived from its raw performance measuregiven by the objective function

Support vector machine (SVM) as a statistical machinelearning theory was introduced in 1995 by Vapnik and Corteas an alternative technique for polynomial radial functionand multilayer perceptron classifiers in which the weights ofthe neurons are found by solving quadratic programming(QP) problem with linearity inequality and equality con-straints rather than by solving a nonconvex unconstrainedminimization problem [36ndash39] As a novel machine learningtechnique for binary classification regression analysis facedetection text categorization in bioinformatics and datamining and outlier detection SVMs face challenges whenthe dataset is very large due to the dense nature and memoryrequirement of the quadratic form of the dataset HoweverSVM is an excellent example of supervised learning that triesto maximize the generalization by maximizing the marginand supports nonlinear separation using kernelization [40]SVM tries to avoid overfitting and underfitting (e marginin SVM denotes the distance from the boundary to theclosest data points in the feature space

Given the claims training dataset correspondingly to xn

Rn isin F in the feature space F the calculated linear hyper-plane dividing them into two labelled classes yi (fraud andlegal) can be mathematically obtained as

ωTxi + b 0 ω isin Rn

b isin R (1)

Assuming the training dataset is correctly classified asshown in Figure 1

(is means that the SVC computes the hyperplane tomaximize the margin separating the classes (legal claims andfraud claims)

Journal of Engineering 3

In the simplest linear form an SVC is a hyperplane thatseparates the legal claims from the false claims with amaximum margin Finding this hyperplane involvesobtaining two hyperplanes parallel to it as shown in Figure 1above with an equal distance to the maximum margin If allthe training dataset satisfies the constraints as follows

ωTxi + ble 1 foryi +1

ωTxi + bge minus 1 foryi minus 1

⎧⎨

⎩ (2)

where ω is the normal to the hyperplane |b|ω is theperpendicular distance from the hyperplane to the originand ω is the Euclidean norm of ω (e separating hy-perplane is defined by the plane ωTxi + b 0 and the aboveconstraints in (2) are combined to form

yi ωTxi + b1113872 1113873ge 1 (3)

(e pair of the hyperplanes that gives the maximummargin (c) can be found by minimizing ω2 subject toconstraint in (9) (is leads to a quadratic optimizationproblem formulated as

Minimize f(ω b) ω2

2

subject to yi ωTxi + b( 1113857ge 1 forall i 1 n

(4)

(is problem is reformulated by introducing Lagrangemultipliers αi(i 1 n ) for each constraint and sub-tracting them from the function f(x) ωTxi + b

(is results in establishing the primal Lagrangianfunction

LP(ω b α) ||ω||2

2+ 1113944

n

i1αi i yi ωT

xi + b1113966 11139671113872 11138731113872 1113873

foralli 1 n

(5)

Taking the partial derivatives of LP(ω b α) with respectto ω bamp α respectively and applying the duality theoryyields

zLP

zω 0⟹ω 1113944

n

i1αiyixi

zLP

zb 0⟹ b 1113944

n

i1αiyi

(6)

(e problem defined in (5) is a quadratic optimization(QP) problem Maximizing the primal problem LP withrespect to αi subject to the constraints that the gradient of LP

with respect to w and b vanish and that αi ge 0 gives thefollowing two conditions

ω 1113944

n

i1αiyixi

1113944

n

i1αiyi 0

(7)

Substituting these constraints gives the dual formulationof the Lagrangian

Maximizeα

LD(ω b α) 1113944n

i1αi minus

12

1113944

n

i11113944

n

j1αiαjyiyj xixj1113872 1113873

subject to 1113944

n

i1αiyi 0 αi ge 0 i 1 n

(8)

But the values of αi ω and b are obtained from theserespective equations namely

ω 1113944n

i1αiyixi

b 12

Miniyi +1ωT

xi + Maxiyi minus 1ωT

xi1113872 1113873

(9)

Also the Lagrange multiplier is computed using

αi 1 minus yi ωTxi + b1113872 11138731113872 1113873 0 (10)

Hence this dual Lagrangian LD is maximized with re-spect to its nonnegative αi to give a standard quadraticoptimization problem (e respective training vectors arecalled support vectorsWith the input dataset xi as a nonzeroLagrangian multiplier αi

yi ωTxi + b1113872 1113873 1 (11)

(e equation above gives the support vectors (SVs)Despite that the SVM classifier can only have a linear

hyperplane as its decision surface its formulation can beextended to build a nonlinear SVM SVMs with nonlineardecision surfaces can classify nonlinearly separable data byintroducing a soft margin hyperplane as shown in Figure 2

Introducing the slack variable into the constraints yields

ωTxi + bge 1 minus ξi foryi +1

ωTxi + bge minus 1 + ξi foryi minus 1

ξi ge 0 foralli

(12)

Legitimateclaims

Misclassifiedpoint

Support vector

b

Margin = 2 (radicwTw)

K (xi xj) = ϕT(xi)ϕ(xj)

Support vector

W

ξ = 0

ξ lt 1

ξ gt 1

Fraudulentclaims

WTϕ(x) + b = ndash1

WTϕ(x) + b = 0

WTϕ(x) + b = +1

Figure 1 Standard formulation of SVM

4 Journal of Engineering

(ese slack variables help to find the hyperplane thatprovides the minimum number of training errorsModifying equation (4) to include the slack variableyields

Minimizeαb ξi

||ω||2

2+ C 1113944

n

i1ξi

subject to αi 1 minus yi ωTxi + b( 1113857( 1113857 + ξi minus 1ge 0 ξi ge 0

(13)

(eparameter C is a regularization parameter that tradesoff the wide margin with a small number of margin failures(e parameter C is finite (e larger the C value the moresignificant the error

(e KarushndashKuhnndashTucker (KKT) conditions are nec-essary to ensure optimality of the solution to a nonlinearprogramming problem

yi ωTxi + b1113872 11138731113872 1113873 minus 1ge 0 i 1 2 l foralli

αi yi ωTxi + b1113872 11138731113872 1113873 minus 1 0 αi ge 0 foralli

(14)

(e KKT conditions for the primal problem are used inthe nonseparable case after which the primal Lagrangianbecomes

LP ||ω||2

2+ c 1113944

n

i1ξi minus 1113944

j

i1αi yi ωT

xi + b1113966 11139671113872 1113873 minus 1 + ξi1113872 1113873

minus 1113944n

i1βiξi

(15)

With βi as the Lagrange multipliers to enforce positivityof the slack variables (ξi) and applying the KKT conditionsto the primal problem yields

zLP

zωu

ωu minus 1113944n

i1αiyixiu 0

zLP

zwu

C minus αiyi 0

zLP

zξi

C minus αi minus βi 0

αi yi ωTxi + b1113966 11139671113872 1113873 minus 1 + ξi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi ge 0

αi βi ξi ge 0 C ξi ge 0

i 1 2 n and u 1 2 d

zLP

zωu

ωu minus 1113944n

i1αiyixiu 0

zLP

zwu

C minus αiyi 0

(16)

where the parameter d represents the dimension of thedataset

Observing the expressions obtained above after applyingKKTconditions yields ξi 0 for αi ltC since βi C minus αi ne 0(is implies that any training point for which 0 lt αi ltC willbe taken to compute for b as a data point that does not crossthe boundary

αi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi gt 0

(17)

(is does not participate in the derivation of the sepa-rating function with αi C and ξi gt 0

αi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi 0

(18)

Nonlinear SVM maps the training samples from theinput space into a higher-dimensional feature space via akernel mapping function F In the dual Lagrangian functionthe inner products are replaced by the kernel function

Φ xi( 1113857 middotΦ xj1113872 11138731113872 1113873 k xi xj1113872 1113873 (19)

Effective kernels are used in finding the separating hy-perplane without high computational resources (e non-linear SVM dual Lagrangian

LD(α) 1113944

n

i1αi minus

12

1113944

n

i11113944

n

j1αiαjyiyjk xi middot xj1113872 1113873 (20)

subject to

1113944

n

i1αiyi 0 0le αi i 1 n (21)

ξ = 0

Support vectorSupport vector

ξ lt 1

b

W

Misclassifiedpoint

ξ gt 1 Margin

Figure 2 Linear separating hyperplanes for the nonseparable caseof SVC by introducing the slack variable (ξ)

Journal of Engineering 5

(is is like that of the generalized linear case(e nonlinear SVM separating hyperplane is illustrated

in Figure 3 with the support vectors class labels andmargin(is model can be solved by the method of optimization

in the separable case (erefore the optimal hyperplane hasthe following form

f(x) 1113944n

i1αiyik xi x( 1113857 + b (22)

where b is the decision boundary from the origin Henceseparating newly arrived dataset x implies that

g(x) sign(f(x)) (23)

However feasible kernels must be symmetrical ie thematrix K with the component k(xi xj) is positive semi-definite and satisfies Mercerrsquos condition given in [39 40](e summarized kernel functions considered in this workare given in Table 1

(ese kernels satisfied Mercerrsquos condition with RBFor Gaussian kernel which is the widely used kernelfunction from the literature (e RBF has an advantage ofadding a single free parameter cgt 0 which controls thewidth of the RBF kernel as c 12σ2 where σ2 is thevariance of the resulting Gaussian hypersphere (elinear kernel is given as k(xi xj) xi middot xj Consequentlythe training of SVMs used the solution of the QP opti-mization problem (e above mathematical formulationsform the foundation for the development and de-ployment of genetic support vector machines as thedecision support tool for detecting and classifying healthinsurance fraudulent claims In recent times decision-making activities of knowledge-intensive enterprisesdepend holistically on the successful classification of datapatterns despite time and computational resources re-quired to achieve the results due to the complexity as-sociated with the dataset and its size

4 Methodology for GSVM Fraud Detection

(e systematic approach adopted for the design and de-velopment of genetic support vector machines for healthinsurance claims fraud detection is presented in the

conceptual framework in Figure 4 and the flow chartimplementation in Figure 5

(e conceptual framework incorporates the designand development of key algorithms that enable submittedclaims data to be analysed and a model to be developedfor testing and validation (e flow chart presents thealgorithm implemented based on theoretical foundationsin incorporating genetic algorithms and support vectormachines two useful machine learning algorithms nec-essary for fraud detection (eir combined use in thedetection process generates accurate results (e meth-odology for the design and development of geneticsupport vector machines as presented above consists ofthree (3) significant steps namely (1) data preprocessing(2) classification engine development and (3) datapostprocessing

41 Data Preprocessing (e data preprocessing is the firstsignificant stage in the development of the fraud detectionsystem(is stage involves the use of data mining techniquesto transform the data from its raw form into the requiredformat to be used by the SVC for the detection and iden-tification of health insurance claims fraud

(e data preprocessing stage involves the removal ofunwanted customers missing records and data smooth-ening (is is to make sure that only useful and relevantinformation is extracted for the next process

Before the preprocessing the data were imported fromMS Excel CSV format into MySQL to a created databasecalled NHIS (e imported data include the electronicHealth Insurance Claims (e-HIC) data and the HIC tariffdatasets as tables imported into the NHIS (e e-HIC datapreprocessing involves the following steps (1) claims datafiltering and selection (2) feature selection and extractionand (3) feature adjustment

Class 1Margin

Class 2

Support vectors

Hyperplane

Figure 3 Nonlinear separating hyperplane for the nonseparable case of SVM

Table 1 Summarized kernel functions used

Kernel name Parameters Kernel functionRadial basis function(RBF) c isin R k(xi xj) eminus cxi minus xj2

Polynomial function c isin R d isin N k(xi xj) (xi middot xj + c)d

6 Journal of Engineering

(e WEKA machine learning and knowledge analysisenvironment were used for feature selection and extractionwhile the data processing codes are written in the MATLABtechnical computing environment (e developed MAT-LAB-based decision support engine was connected viaMYSQL using the script shown in Figure 6

Preprocessing of the raw data involves claims cost val-idity checks(e tariff dataset consists of the approved tariffsfor each diagnostic-related group which was strictlyenforced to clean the data before further processing Claimsare partitioned into two namely (1) claims with the validand approved cost within each DRG and (2) claims withinvalid costs (those above the approved tariffs within eachDRG)

With the recent increase in the volume of real datasetand dimensionality of the claims data there is the urgentneed for a faster more reliable and cost-effective datamining technique for classification models (e data miningtechniques require the extraction of a smaller and optimizedset of features that can be obtained by removing largelyredundant irrelevant and unnecessary features for the classprediction [41]

Feature selection algorithms are utilized to extract aminimal subset of attributes such that the resulting prob-ability distribution of data classes is close to the originaldistribution obtained using all attributes Based on the ideaof survival of the fittest a new population is constructed tocomply with fittest rules in the current population as well asthe offspring of these rules Offsprings are generated byapplying genetic operators such as crossover and mutation(e process of offspring generation continues until it evolvesa population N where every rule in N satisfies the fitnessthreshold With an initial population of 20 instances gen-eration continued till the 20th generation with crossoverprobability of 06 and mutation probability of 0033(e selected features based on genetic algorithms are

ldquoAttendance daterdquo ldquoHospital coderdquo ldquoGDRG coderdquo ldquoServicebillrdquo and ldquoDrug billrdquo (ese are the features selectedextracted and used as the basis for the optimization problemformulated below

Minimize Totalcos t f Sbill Dbill( 1113857

subject to 1113944

n

i1Sibill1113872 1113873le Gtariff forall i i 1 2 n

1113944

n

i1Dibill1113872 1113873le Dtariff forallj j 1 2 n

Sbill is the Service bill

Dbill is theDrug bill(24)

(e GA e-HIC dataset is subjected to SVM trainingusing 70 of the dataset and 30 for testing as depicted inFigure 7

(e e-HIC dataset which passes the preprocessing stagethat is the valid claims was used for SVM training andtesting(e best data those that meet the genetic algorithmrsquoscriteria are classified first Each record of this dataset isclassified as either ldquoFraudulent Billsrdquo or ldquoLegal Billsrdquo

(e same SVM training and the testing dataset areapplied to the SVM algorithm for its performance analysis(e inbuilt MATLAB code for SVM classifiers was in-tegrated as one function for linear polynomial and RBFkernels (e claim datasets were partitioned for the classifiertraining testing and validation 70 of the dataset was usedfor training and 30 used for testing (e linear poly-nomial and radial basis function SVM classification kernelswere used with ten-fold cross validation for each kernel andthe results averaged For the polynomial classification kernela cubic polynomial was used (e RBF classification kernelused the SMO method [40] (is method ensures the

Fraud detection classifier

Duplicated claims Upcoded claims Unbundled claims Uncovered claims

SuccessLegal claim Fraudulent claim

End GSVMoptimization

NoYes

Valid claims

Fraud detectionmodel

Figure 4 Conceptual model design and development of the genetic support vector machines

Journal of Engineering 7

handling of large data sizes as it does data transformationthrough kernelization After running many instances andvarying parameters for RBF a variance of 09 gave betterresults as it corresponded well with the datasets for theclassification After each classification the correct rate iscalculated and the confusion matrix extracted (e confu-sion matrix gives a count for the true legal true fraudulentfalse legal false fraudulent and inconclusive bills

(i) True legal bills this consists of the number of ldquoLegalBillsrdquo which were correctly classified as ldquoLegal Billsrdquoby the classifier

(ii) True fraudulent bills this consists of the number ofldquoFraudulent Billsrdquo which were correctly classified asldquoFraudulent Billsrdquo by the classifier

(iii) False legal bills this consists of the bills classified asldquoLegal Billsrdquo even though they are not (at is theseare wrongly classified as ldquoLegal Billsrdquo by the kernelused

(iv) False fraudulent bills the classifier also wronglyclassified bills as fraudulent (e confusion matrixgives a count of these wrongly or incorrectly clas-sified bills

Data

Validation

Feature subset selection byRoulette wheel

Training SVM classifier

Trained SVMclassifier

Success

Evaluate fitness

Recombination

Crossover

Mutation

Population

Generation

Optimized SVMhyperparameter (C y) store in PPD

Frauddetection

model

Yes

No

Training dataset Testing dataset

Testing datasetwith feature

selection subset

Training datasetwith feature

selection subset

Figure 5 Flow chart for design and development of the genetic support vector machines

8 Journal of Engineering

(v) Inconclusive bills these consist of nonclassifiedbills

(e correct rate this is calculated as the total number ofcorrectly classified bills namely the true legal bills and truefraudulent bills divided by the total number of bills used forthe classification

correct rate number of TLB + number of TFB

total number of bills (TB) (25)

where TLB True Legal Bills TFB True Fraudulent Bills

accuracy (1 minus Error) TP + TN

TP + TN + FP + FN Pr(C)

(26)

(e probability of a correct classification

411 Sensitivity (is is the statistical measure of the pro-portion of actual fraudulent claims which are correctlydetected

sensitivity TP

TP + FNTPPP

(27)

412 Specificity (is is the statistical measure of the pro-portion of negative fraudulent claims which are correctlyclassified

specificity TN

TN + FP

TNNP

(28)

42 GSVM Fraud Detection System Implementation andTesting (e decision support system comprises four mainmodules integrated together namely (1) algorithm imple-mentation using MATLAB technical computing platform(2) development of graphical user interface (GUI) for theHIC fraud detection system which consists of uploading andprocessing of claims management (3) system administratormanagement and (4) postprocessing of detection andclassification results

Figure 6 MATLAB-based decision support engine connection to the database

SVMtraining

and testingdataset

GAe-HICdata

Creation ofclaims record

database

Claimsfiltering and

selection

Featureselection

andextraction

Featureadjustment

Datanormalization

Datapreprocessing

Figure 7 Data preprocessing for SVM training and testing

Journal of Engineering 9

(e front end of the detection system was developedusing XAMPP a free and open-source cross-platform webserver solution stack package developed by Apache Friends[42] consisting mainly of the Apache HTTP ServerMariaDB database and interpreters for scripts written in the

PHP and Perl programming languages [42] XAMPP standsfor Cross-Platform (X) Apache (A) MariaDB (M) PHP (P)and Perl (P) (e Health Insurance Claims Fraud DetectionSystem (HICFDS) was developed using MATLAB technicalcomputing environment with the capability to connect to an

MySQL database

Autocreation of resultsdatabase

Data flowchart for NHIS claims fraud detectionsystem process

Upload in GUI

Model amp GSVMalgorithm

Detected results

DEVELOPED GUI

Engine

NHIS claims dataset

NHIS FRAUD DETECTION SYSTEM USING GSVM

Exploratorydata

analysis

Figure 8 System implementation architecture for HICFDS

Figure 9 Detection results control portal interface

10 Journal of Engineering

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

auditing all claims is not feasible by any practical meansFurthermore it is challenging to audit providers withoutconcrete smoking clues A practical approach is to developshortlists for scrutiny and perform auditing on providersand patients in the shortlists Various analytical techniquescan be employed in developing audit shortlists

(e most common fraud detection techniques reportedthrough the literature include the use of machine learningdata mining AI and statistical methods (e most cost-saving model using the NaıvendashBayes algorithm was used tocreate a subsample of 20 claims consisting of 400 objectswhere 50 of objects were classified as fraud and the other50 classified as legal which eventually does not give a clearpicture of the decision if compared to other classifiers [22]

(e integration of multiple traditional methods hasemerged as a new research area in combating fraud (isapproach could be supervised unsupervised or both forone method to depend on the other for classification Onemethod may be used as a preprocessing step to modify thedata in preparation for classification [9 23 24] or at a lowerlevel the individual steps of the algorithms can be inter-twined to create something fundamentally original Hybridmethods can be used to tailor solutions to a particularproblem domain Different aspects of performance can bespecifically targeted including classification ability ease ofuse and computational efficiency [14]

Fuzzy logic was combined with neural networks to assessand automatically classify medical claims [14] (e conceptof data warehousing for data mining purposes in health carewas applied to develop an electronic fraud detection ap-plication to review service providers on behavioral heuristicsand compared to similar service providers AustraliarsquosHealth Insurance Commission has explored the onlinediscounting learning algorithm to identify rare cases inpathology insurance data [10 25ndash27]

Researchers in Taiwan developed a detection modelbased on process mining that systematically identifiedpractices derived from clinical pathways to detect fraudulentclaims [8]

Results published in [28 29] used Benfordrsquos Law Dis-tributions to detect anomalies in claims reimbursements inCanada Despite the detection of some anomalies and ir-regularities the ability to identify suspected claims is verylimited to health insurance claim fraud detection since itapplies to service providers with payer-fixed prices

Neural networks were used to develop an application fordetecting medical abuse and fraud for a private health in-surance scheme in Chile [30] (e ability to process claimson real-time basis accounts for the innovative nature of thismethod (e application of association rule mining to ex-amine billing patterns within a particular specialist group todetect these suspicious claims and potential fraudulent in-dividuals was incorporated in [9 22 30]

3 Mathematical Foundations for GeneticSupport Vector Machines

In the 1960s John Holland invented genetic algorithms byinvolving a simulation of Darwinian survival of the fittest as

well as the processes of crossover mutation and inversionthat occurs in other genetics Hollandrsquos inversion demon-strated that under certain assumptions GA indeed achievesan optimal balance [31ndash34] In contrast with evolutionstrategies and evolutionary programming Hollandrsquos originalgoal was not to design algorithms to solve specific problemsbut rather to formally study the phenomenon of adaptationas it occurs in nature and to develop ways in which themechanisms of natural adaptation might be imported intocomputer systems Moreover Holland was the first to at-tempt to put computational evolution on a firm theoreticalfooting [35]

Genetic algorithms operate through three main opera-tors namely (1) reproduction (2) crossover and (3) mu-tation A typical genetic algorithm requires (1) a geneticrepresentation of the solution domain and (2) a fitnessfunction to evaluate the solution domain [31ndash34]

Reproduction is controlled by crossover and mutationoperators Crossover is the process whereby genes are se-lected from the parent chromosomes and new offsprings areproduced A mutation is designed to add diversity to thepopulation and ensure the possibility of exploring the entiresearch space It replaces the values of some randomly se-lected genes of a chromosome by some arbitrary new values[33 35]

During the reproduction stage an individual is assigneda fitness value derived from its raw performance measuregiven by the objective function

Support vector machine (SVM) as a statistical machinelearning theory was introduced in 1995 by Vapnik and Corteas an alternative technique for polynomial radial functionand multilayer perceptron classifiers in which the weights ofthe neurons are found by solving quadratic programming(QP) problem with linearity inequality and equality con-straints rather than by solving a nonconvex unconstrainedminimization problem [36ndash39] As a novel machine learningtechnique for binary classification regression analysis facedetection text categorization in bioinformatics and datamining and outlier detection SVMs face challenges whenthe dataset is very large due to the dense nature and memoryrequirement of the quadratic form of the dataset HoweverSVM is an excellent example of supervised learning that triesto maximize the generalization by maximizing the marginand supports nonlinear separation using kernelization [40]SVM tries to avoid overfitting and underfitting (e marginin SVM denotes the distance from the boundary to theclosest data points in the feature space

Given the claims training dataset correspondingly to xn

Rn isin F in the feature space F the calculated linear hyper-plane dividing them into two labelled classes yi (fraud andlegal) can be mathematically obtained as

ωTxi + b 0 ω isin Rn

b isin R (1)

Assuming the training dataset is correctly classified asshown in Figure 1

(is means that the SVC computes the hyperplane tomaximize the margin separating the classes (legal claims andfraud claims)

Journal of Engineering 3

In the simplest linear form an SVC is a hyperplane thatseparates the legal claims from the false claims with amaximum margin Finding this hyperplane involvesobtaining two hyperplanes parallel to it as shown in Figure 1above with an equal distance to the maximum margin If allthe training dataset satisfies the constraints as follows

ωTxi + ble 1 foryi +1

ωTxi + bge minus 1 foryi minus 1

⎧⎨

⎩ (2)

where ω is the normal to the hyperplane |b|ω is theperpendicular distance from the hyperplane to the originand ω is the Euclidean norm of ω (e separating hy-perplane is defined by the plane ωTxi + b 0 and the aboveconstraints in (2) are combined to form

yi ωTxi + b1113872 1113873ge 1 (3)

(e pair of the hyperplanes that gives the maximummargin (c) can be found by minimizing ω2 subject toconstraint in (9) (is leads to a quadratic optimizationproblem formulated as

Minimize f(ω b) ω2

2

subject to yi ωTxi + b( 1113857ge 1 forall i 1 n

(4)

(is problem is reformulated by introducing Lagrangemultipliers αi(i 1 n ) for each constraint and sub-tracting them from the function f(x) ωTxi + b

(is results in establishing the primal Lagrangianfunction

LP(ω b α) ||ω||2

2+ 1113944

n

i1αi i yi ωT

xi + b1113966 11139671113872 11138731113872 1113873

foralli 1 n

(5)

Taking the partial derivatives of LP(ω b α) with respectto ω bamp α respectively and applying the duality theoryyields

zLP

zω 0⟹ω 1113944

n

i1αiyixi

zLP

zb 0⟹ b 1113944

n

i1αiyi

(6)

(e problem defined in (5) is a quadratic optimization(QP) problem Maximizing the primal problem LP withrespect to αi subject to the constraints that the gradient of LP

with respect to w and b vanish and that αi ge 0 gives thefollowing two conditions

ω 1113944

n

i1αiyixi

1113944

n

i1αiyi 0

(7)

Substituting these constraints gives the dual formulationof the Lagrangian

Maximizeα

LD(ω b α) 1113944n

i1αi minus

12

1113944

n

i11113944

n

j1αiαjyiyj xixj1113872 1113873

subject to 1113944

n

i1αiyi 0 αi ge 0 i 1 n

(8)

But the values of αi ω and b are obtained from theserespective equations namely

ω 1113944n

i1αiyixi

b 12

Miniyi +1ωT

xi + Maxiyi minus 1ωT

xi1113872 1113873

(9)

Also the Lagrange multiplier is computed using

αi 1 minus yi ωTxi + b1113872 11138731113872 1113873 0 (10)

Hence this dual Lagrangian LD is maximized with re-spect to its nonnegative αi to give a standard quadraticoptimization problem (e respective training vectors arecalled support vectorsWith the input dataset xi as a nonzeroLagrangian multiplier αi

yi ωTxi + b1113872 1113873 1 (11)

(e equation above gives the support vectors (SVs)Despite that the SVM classifier can only have a linear

hyperplane as its decision surface its formulation can beextended to build a nonlinear SVM SVMs with nonlineardecision surfaces can classify nonlinearly separable data byintroducing a soft margin hyperplane as shown in Figure 2

Introducing the slack variable into the constraints yields

ωTxi + bge 1 minus ξi foryi +1

ωTxi + bge minus 1 + ξi foryi minus 1

ξi ge 0 foralli

(12)

Legitimateclaims

Misclassifiedpoint

Support vector

b

Margin = 2 (radicwTw)

K (xi xj) = ϕT(xi)ϕ(xj)

Support vector

W

ξ = 0

ξ lt 1

ξ gt 1

Fraudulentclaims

WTϕ(x) + b = ndash1

WTϕ(x) + b = 0

WTϕ(x) + b = +1

Figure 1 Standard formulation of SVM

4 Journal of Engineering

(ese slack variables help to find the hyperplane thatprovides the minimum number of training errorsModifying equation (4) to include the slack variableyields

Minimizeαb ξi

||ω||2

2+ C 1113944

n

i1ξi

subject to αi 1 minus yi ωTxi + b( 1113857( 1113857 + ξi minus 1ge 0 ξi ge 0

(13)

(eparameter C is a regularization parameter that tradesoff the wide margin with a small number of margin failures(e parameter C is finite (e larger the C value the moresignificant the error

(e KarushndashKuhnndashTucker (KKT) conditions are nec-essary to ensure optimality of the solution to a nonlinearprogramming problem

yi ωTxi + b1113872 11138731113872 1113873 minus 1ge 0 i 1 2 l foralli

αi yi ωTxi + b1113872 11138731113872 1113873 minus 1 0 αi ge 0 foralli

(14)

(e KKT conditions for the primal problem are used inthe nonseparable case after which the primal Lagrangianbecomes

LP ||ω||2

2+ c 1113944

n

i1ξi minus 1113944

j

i1αi yi ωT

xi + b1113966 11139671113872 1113873 minus 1 + ξi1113872 1113873

minus 1113944n

i1βiξi

(15)

With βi as the Lagrange multipliers to enforce positivityof the slack variables (ξi) and applying the KKT conditionsto the primal problem yields

zLP

zωu

ωu minus 1113944n

i1αiyixiu 0

zLP

zwu

C minus αiyi 0

zLP

zξi

C minus αi minus βi 0

αi yi ωTxi + b1113966 11139671113872 1113873 minus 1 + ξi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi ge 0

αi βi ξi ge 0 C ξi ge 0

i 1 2 n and u 1 2 d

zLP

zωu

ωu minus 1113944n

i1αiyixiu 0

zLP

zwu

C minus αiyi 0

(16)

where the parameter d represents the dimension of thedataset

Observing the expressions obtained above after applyingKKTconditions yields ξi 0 for αi ltC since βi C minus αi ne 0(is implies that any training point for which 0 lt αi ltC willbe taken to compute for b as a data point that does not crossthe boundary

αi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi gt 0

(17)

(is does not participate in the derivation of the sepa-rating function with αi C and ξi gt 0

αi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi 0

(18)

Nonlinear SVM maps the training samples from theinput space into a higher-dimensional feature space via akernel mapping function F In the dual Lagrangian functionthe inner products are replaced by the kernel function

Φ xi( 1113857 middotΦ xj1113872 11138731113872 1113873 k xi xj1113872 1113873 (19)

Effective kernels are used in finding the separating hy-perplane without high computational resources (e non-linear SVM dual Lagrangian

LD(α) 1113944

n

i1αi minus

12

1113944

n

i11113944

n

j1αiαjyiyjk xi middot xj1113872 1113873 (20)

subject to

1113944

n

i1αiyi 0 0le αi i 1 n (21)

ξ = 0

Support vectorSupport vector

ξ lt 1

b

W

Misclassifiedpoint

ξ gt 1 Margin

Figure 2 Linear separating hyperplanes for the nonseparable caseof SVC by introducing the slack variable (ξ)

Journal of Engineering 5

(is is like that of the generalized linear case(e nonlinear SVM separating hyperplane is illustrated

in Figure 3 with the support vectors class labels andmargin(is model can be solved by the method of optimization

in the separable case (erefore the optimal hyperplane hasthe following form

f(x) 1113944n

i1αiyik xi x( 1113857 + b (22)

where b is the decision boundary from the origin Henceseparating newly arrived dataset x implies that

g(x) sign(f(x)) (23)

However feasible kernels must be symmetrical ie thematrix K with the component k(xi xj) is positive semi-definite and satisfies Mercerrsquos condition given in [39 40](e summarized kernel functions considered in this workare given in Table 1

(ese kernels satisfied Mercerrsquos condition with RBFor Gaussian kernel which is the widely used kernelfunction from the literature (e RBF has an advantage ofadding a single free parameter cgt 0 which controls thewidth of the RBF kernel as c 12σ2 where σ2 is thevariance of the resulting Gaussian hypersphere (elinear kernel is given as k(xi xj) xi middot xj Consequentlythe training of SVMs used the solution of the QP opti-mization problem (e above mathematical formulationsform the foundation for the development and de-ployment of genetic support vector machines as thedecision support tool for detecting and classifying healthinsurance fraudulent claims In recent times decision-making activities of knowledge-intensive enterprisesdepend holistically on the successful classification of datapatterns despite time and computational resources re-quired to achieve the results due to the complexity as-sociated with the dataset and its size

4 Methodology for GSVM Fraud Detection

(e systematic approach adopted for the design and de-velopment of genetic support vector machines for healthinsurance claims fraud detection is presented in the

conceptual framework in Figure 4 and the flow chartimplementation in Figure 5

(e conceptual framework incorporates the designand development of key algorithms that enable submittedclaims data to be analysed and a model to be developedfor testing and validation (e flow chart presents thealgorithm implemented based on theoretical foundationsin incorporating genetic algorithms and support vectormachines two useful machine learning algorithms nec-essary for fraud detection (eir combined use in thedetection process generates accurate results (e meth-odology for the design and development of geneticsupport vector machines as presented above consists ofthree (3) significant steps namely (1) data preprocessing(2) classification engine development and (3) datapostprocessing

41 Data Preprocessing (e data preprocessing is the firstsignificant stage in the development of the fraud detectionsystem(is stage involves the use of data mining techniquesto transform the data from its raw form into the requiredformat to be used by the SVC for the detection and iden-tification of health insurance claims fraud

(e data preprocessing stage involves the removal ofunwanted customers missing records and data smooth-ening (is is to make sure that only useful and relevantinformation is extracted for the next process

Before the preprocessing the data were imported fromMS Excel CSV format into MySQL to a created databasecalled NHIS (e imported data include the electronicHealth Insurance Claims (e-HIC) data and the HIC tariffdatasets as tables imported into the NHIS (e e-HIC datapreprocessing involves the following steps (1) claims datafiltering and selection (2) feature selection and extractionand (3) feature adjustment

Class 1Margin

Class 2

Support vectors

Hyperplane

Figure 3 Nonlinear separating hyperplane for the nonseparable case of SVM

Table 1 Summarized kernel functions used

Kernel name Parameters Kernel functionRadial basis function(RBF) c isin R k(xi xj) eminus cxi minus xj2

Polynomial function c isin R d isin N k(xi xj) (xi middot xj + c)d

6 Journal of Engineering

(e WEKA machine learning and knowledge analysisenvironment were used for feature selection and extractionwhile the data processing codes are written in the MATLABtechnical computing environment (e developed MAT-LAB-based decision support engine was connected viaMYSQL using the script shown in Figure 6

Preprocessing of the raw data involves claims cost val-idity checks(e tariff dataset consists of the approved tariffsfor each diagnostic-related group which was strictlyenforced to clean the data before further processing Claimsare partitioned into two namely (1) claims with the validand approved cost within each DRG and (2) claims withinvalid costs (those above the approved tariffs within eachDRG)

With the recent increase in the volume of real datasetand dimensionality of the claims data there is the urgentneed for a faster more reliable and cost-effective datamining technique for classification models (e data miningtechniques require the extraction of a smaller and optimizedset of features that can be obtained by removing largelyredundant irrelevant and unnecessary features for the classprediction [41]

Feature selection algorithms are utilized to extract aminimal subset of attributes such that the resulting prob-ability distribution of data classes is close to the originaldistribution obtained using all attributes Based on the ideaof survival of the fittest a new population is constructed tocomply with fittest rules in the current population as well asthe offspring of these rules Offsprings are generated byapplying genetic operators such as crossover and mutation(e process of offspring generation continues until it evolvesa population N where every rule in N satisfies the fitnessthreshold With an initial population of 20 instances gen-eration continued till the 20th generation with crossoverprobability of 06 and mutation probability of 0033(e selected features based on genetic algorithms are

ldquoAttendance daterdquo ldquoHospital coderdquo ldquoGDRG coderdquo ldquoServicebillrdquo and ldquoDrug billrdquo (ese are the features selectedextracted and used as the basis for the optimization problemformulated below

Minimize Totalcos t f Sbill Dbill( 1113857

subject to 1113944

n

i1Sibill1113872 1113873le Gtariff forall i i 1 2 n

1113944

n

i1Dibill1113872 1113873le Dtariff forallj j 1 2 n

Sbill is the Service bill

Dbill is theDrug bill(24)

(e GA e-HIC dataset is subjected to SVM trainingusing 70 of the dataset and 30 for testing as depicted inFigure 7

(e e-HIC dataset which passes the preprocessing stagethat is the valid claims was used for SVM training andtesting(e best data those that meet the genetic algorithmrsquoscriteria are classified first Each record of this dataset isclassified as either ldquoFraudulent Billsrdquo or ldquoLegal Billsrdquo

(e same SVM training and the testing dataset areapplied to the SVM algorithm for its performance analysis(e inbuilt MATLAB code for SVM classifiers was in-tegrated as one function for linear polynomial and RBFkernels (e claim datasets were partitioned for the classifiertraining testing and validation 70 of the dataset was usedfor training and 30 used for testing (e linear poly-nomial and radial basis function SVM classification kernelswere used with ten-fold cross validation for each kernel andthe results averaged For the polynomial classification kernela cubic polynomial was used (e RBF classification kernelused the SMO method [40] (is method ensures the

Fraud detection classifier

Duplicated claims Upcoded claims Unbundled claims Uncovered claims

SuccessLegal claim Fraudulent claim

End GSVMoptimization

NoYes

Valid claims

Fraud detectionmodel

Figure 4 Conceptual model design and development of the genetic support vector machines

Journal of Engineering 7

handling of large data sizes as it does data transformationthrough kernelization After running many instances andvarying parameters for RBF a variance of 09 gave betterresults as it corresponded well with the datasets for theclassification After each classification the correct rate iscalculated and the confusion matrix extracted (e confu-sion matrix gives a count for the true legal true fraudulentfalse legal false fraudulent and inconclusive bills

(i) True legal bills this consists of the number of ldquoLegalBillsrdquo which were correctly classified as ldquoLegal Billsrdquoby the classifier

(ii) True fraudulent bills this consists of the number ofldquoFraudulent Billsrdquo which were correctly classified asldquoFraudulent Billsrdquo by the classifier

(iii) False legal bills this consists of the bills classified asldquoLegal Billsrdquo even though they are not (at is theseare wrongly classified as ldquoLegal Billsrdquo by the kernelused

(iv) False fraudulent bills the classifier also wronglyclassified bills as fraudulent (e confusion matrixgives a count of these wrongly or incorrectly clas-sified bills

Data

Validation

Feature subset selection byRoulette wheel

Training SVM classifier

Trained SVMclassifier

Success

Evaluate fitness

Recombination

Crossover

Mutation

Population

Generation

Optimized SVMhyperparameter (C y) store in PPD

Frauddetection

model

Yes

No

Training dataset Testing dataset

Testing datasetwith feature

selection subset

Training datasetwith feature

selection subset

Figure 5 Flow chart for design and development of the genetic support vector machines

8 Journal of Engineering

(v) Inconclusive bills these consist of nonclassifiedbills

(e correct rate this is calculated as the total number ofcorrectly classified bills namely the true legal bills and truefraudulent bills divided by the total number of bills used forthe classification

correct rate number of TLB + number of TFB

total number of bills (TB) (25)

where TLB True Legal Bills TFB True Fraudulent Bills

accuracy (1 minus Error) TP + TN

TP + TN + FP + FN Pr(C)

(26)

(e probability of a correct classification

411 Sensitivity (is is the statistical measure of the pro-portion of actual fraudulent claims which are correctlydetected

sensitivity TP

TP + FNTPPP

(27)

412 Specificity (is is the statistical measure of the pro-portion of negative fraudulent claims which are correctlyclassified

specificity TN

TN + FP

TNNP

(28)

42 GSVM Fraud Detection System Implementation andTesting (e decision support system comprises four mainmodules integrated together namely (1) algorithm imple-mentation using MATLAB technical computing platform(2) development of graphical user interface (GUI) for theHIC fraud detection system which consists of uploading andprocessing of claims management (3) system administratormanagement and (4) postprocessing of detection andclassification results

Figure 6 MATLAB-based decision support engine connection to the database

SVMtraining

and testingdataset

GAe-HICdata

Creation ofclaims record

database

Claimsfiltering and

selection

Featureselection

andextraction

Featureadjustment

Datanormalization

Datapreprocessing

Figure 7 Data preprocessing for SVM training and testing

Journal of Engineering 9

(e front end of the detection system was developedusing XAMPP a free and open-source cross-platform webserver solution stack package developed by Apache Friends[42] consisting mainly of the Apache HTTP ServerMariaDB database and interpreters for scripts written in the

PHP and Perl programming languages [42] XAMPP standsfor Cross-Platform (X) Apache (A) MariaDB (M) PHP (P)and Perl (P) (e Health Insurance Claims Fraud DetectionSystem (HICFDS) was developed using MATLAB technicalcomputing environment with the capability to connect to an

MySQL database

Autocreation of resultsdatabase

Data flowchart for NHIS claims fraud detectionsystem process

Upload in GUI

Model amp GSVMalgorithm

Detected results

DEVELOPED GUI

Engine

NHIS claims dataset

NHIS FRAUD DETECTION SYSTEM USING GSVM

Exploratorydata

analysis

Figure 8 System implementation architecture for HICFDS

Figure 9 Detection results control portal interface

10 Journal of Engineering

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

In the simplest linear form an SVC is a hyperplane thatseparates the legal claims from the false claims with amaximum margin Finding this hyperplane involvesobtaining two hyperplanes parallel to it as shown in Figure 1above with an equal distance to the maximum margin If allthe training dataset satisfies the constraints as follows

ωTxi + ble 1 foryi +1

ωTxi + bge minus 1 foryi minus 1

⎧⎨

⎩ (2)

where ω is the normal to the hyperplane |b|ω is theperpendicular distance from the hyperplane to the originand ω is the Euclidean norm of ω (e separating hy-perplane is defined by the plane ωTxi + b 0 and the aboveconstraints in (2) are combined to form

yi ωTxi + b1113872 1113873ge 1 (3)

(e pair of the hyperplanes that gives the maximummargin (c) can be found by minimizing ω2 subject toconstraint in (9) (is leads to a quadratic optimizationproblem formulated as

Minimize f(ω b) ω2

2

subject to yi ωTxi + b( 1113857ge 1 forall i 1 n

(4)

(is problem is reformulated by introducing Lagrangemultipliers αi(i 1 n ) for each constraint and sub-tracting them from the function f(x) ωTxi + b

(is results in establishing the primal Lagrangianfunction

LP(ω b α) ||ω||2

2+ 1113944

n

i1αi i yi ωT

xi + b1113966 11139671113872 11138731113872 1113873

foralli 1 n

(5)

Taking the partial derivatives of LP(ω b α) with respectto ω bamp α respectively and applying the duality theoryyields

zLP

zω 0⟹ω 1113944

n

i1αiyixi

zLP

zb 0⟹ b 1113944

n

i1αiyi

(6)

(e problem defined in (5) is a quadratic optimization(QP) problem Maximizing the primal problem LP withrespect to αi subject to the constraints that the gradient of LP

with respect to w and b vanish and that αi ge 0 gives thefollowing two conditions

ω 1113944

n

i1αiyixi

1113944

n

i1αiyi 0

(7)

Substituting these constraints gives the dual formulationof the Lagrangian

Maximizeα

LD(ω b α) 1113944n

i1αi minus

12

1113944

n

i11113944

n

j1αiαjyiyj xixj1113872 1113873

subject to 1113944

n

i1αiyi 0 αi ge 0 i 1 n

(8)

But the values of αi ω and b are obtained from theserespective equations namely

ω 1113944n

i1αiyixi

b 12

Miniyi +1ωT

xi + Maxiyi minus 1ωT

xi1113872 1113873

(9)

Also the Lagrange multiplier is computed using

αi 1 minus yi ωTxi + b1113872 11138731113872 1113873 0 (10)

Hence this dual Lagrangian LD is maximized with re-spect to its nonnegative αi to give a standard quadraticoptimization problem (e respective training vectors arecalled support vectorsWith the input dataset xi as a nonzeroLagrangian multiplier αi

yi ωTxi + b1113872 1113873 1 (11)

(e equation above gives the support vectors (SVs)Despite that the SVM classifier can only have a linear

hyperplane as its decision surface its formulation can beextended to build a nonlinear SVM SVMs with nonlineardecision surfaces can classify nonlinearly separable data byintroducing a soft margin hyperplane as shown in Figure 2

Introducing the slack variable into the constraints yields

ωTxi + bge 1 minus ξi foryi +1

ωTxi + bge minus 1 + ξi foryi minus 1

ξi ge 0 foralli

(12)

Legitimateclaims

Misclassifiedpoint

Support vector

b

Margin = 2 (radicwTw)

K (xi xj) = ϕT(xi)ϕ(xj)

Support vector

W

ξ = 0

ξ lt 1

ξ gt 1

Fraudulentclaims

WTϕ(x) + b = ndash1

WTϕ(x) + b = 0

WTϕ(x) + b = +1

Figure 1 Standard formulation of SVM

4 Journal of Engineering

(ese slack variables help to find the hyperplane thatprovides the minimum number of training errorsModifying equation (4) to include the slack variableyields

Minimizeαb ξi

||ω||2

2+ C 1113944

n

i1ξi

subject to αi 1 minus yi ωTxi + b( 1113857( 1113857 + ξi minus 1ge 0 ξi ge 0

(13)

(eparameter C is a regularization parameter that tradesoff the wide margin with a small number of margin failures(e parameter C is finite (e larger the C value the moresignificant the error

(e KarushndashKuhnndashTucker (KKT) conditions are nec-essary to ensure optimality of the solution to a nonlinearprogramming problem

yi ωTxi + b1113872 11138731113872 1113873 minus 1ge 0 i 1 2 l foralli

αi yi ωTxi + b1113872 11138731113872 1113873 minus 1 0 αi ge 0 foralli

(14)

(e KKT conditions for the primal problem are used inthe nonseparable case after which the primal Lagrangianbecomes

LP ||ω||2

2+ c 1113944

n

i1ξi minus 1113944

j

i1αi yi ωT

xi + b1113966 11139671113872 1113873 minus 1 + ξi1113872 1113873

minus 1113944n

i1βiξi

(15)

With βi as the Lagrange multipliers to enforce positivityof the slack variables (ξi) and applying the KKT conditionsto the primal problem yields

zLP

zωu

ωu minus 1113944n

i1αiyixiu 0

zLP

zwu

C minus αiyi 0

zLP

zξi

C minus αi minus βi 0

αi yi ωTxi + b1113966 11139671113872 1113873 minus 1 + ξi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi ge 0

αi βi ξi ge 0 C ξi ge 0

i 1 2 n and u 1 2 d

zLP

zωu

ωu minus 1113944n

i1αiyixiu 0

zLP

zwu

C minus αiyi 0

(16)

where the parameter d represents the dimension of thedataset

Observing the expressions obtained above after applyingKKTconditions yields ξi 0 for αi ltC since βi C minus αi ne 0(is implies that any training point for which 0 lt αi ltC willbe taken to compute for b as a data point that does not crossthe boundary

αi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi gt 0

(17)

(is does not participate in the derivation of the sepa-rating function with αi C and ξi gt 0

αi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi 0

(18)

Nonlinear SVM maps the training samples from theinput space into a higher-dimensional feature space via akernel mapping function F In the dual Lagrangian functionthe inner products are replaced by the kernel function

Φ xi( 1113857 middotΦ xj1113872 11138731113872 1113873 k xi xj1113872 1113873 (19)

Effective kernels are used in finding the separating hy-perplane without high computational resources (e non-linear SVM dual Lagrangian

LD(α) 1113944

n

i1αi minus

12

1113944

n

i11113944

n

j1αiαjyiyjk xi middot xj1113872 1113873 (20)

subject to

1113944

n

i1αiyi 0 0le αi i 1 n (21)

ξ = 0

Support vectorSupport vector

ξ lt 1

b

W

Misclassifiedpoint

ξ gt 1 Margin

Figure 2 Linear separating hyperplanes for the nonseparable caseof SVC by introducing the slack variable (ξ)

Journal of Engineering 5

(is is like that of the generalized linear case(e nonlinear SVM separating hyperplane is illustrated

in Figure 3 with the support vectors class labels andmargin(is model can be solved by the method of optimization

in the separable case (erefore the optimal hyperplane hasthe following form

f(x) 1113944n

i1αiyik xi x( 1113857 + b (22)

where b is the decision boundary from the origin Henceseparating newly arrived dataset x implies that

g(x) sign(f(x)) (23)

However feasible kernels must be symmetrical ie thematrix K with the component k(xi xj) is positive semi-definite and satisfies Mercerrsquos condition given in [39 40](e summarized kernel functions considered in this workare given in Table 1

(ese kernels satisfied Mercerrsquos condition with RBFor Gaussian kernel which is the widely used kernelfunction from the literature (e RBF has an advantage ofadding a single free parameter cgt 0 which controls thewidth of the RBF kernel as c 12σ2 where σ2 is thevariance of the resulting Gaussian hypersphere (elinear kernel is given as k(xi xj) xi middot xj Consequentlythe training of SVMs used the solution of the QP opti-mization problem (e above mathematical formulationsform the foundation for the development and de-ployment of genetic support vector machines as thedecision support tool for detecting and classifying healthinsurance fraudulent claims In recent times decision-making activities of knowledge-intensive enterprisesdepend holistically on the successful classification of datapatterns despite time and computational resources re-quired to achieve the results due to the complexity as-sociated with the dataset and its size

4 Methodology for GSVM Fraud Detection

(e systematic approach adopted for the design and de-velopment of genetic support vector machines for healthinsurance claims fraud detection is presented in the

conceptual framework in Figure 4 and the flow chartimplementation in Figure 5

(e conceptual framework incorporates the designand development of key algorithms that enable submittedclaims data to be analysed and a model to be developedfor testing and validation (e flow chart presents thealgorithm implemented based on theoretical foundationsin incorporating genetic algorithms and support vectormachines two useful machine learning algorithms nec-essary for fraud detection (eir combined use in thedetection process generates accurate results (e meth-odology for the design and development of geneticsupport vector machines as presented above consists ofthree (3) significant steps namely (1) data preprocessing(2) classification engine development and (3) datapostprocessing

41 Data Preprocessing (e data preprocessing is the firstsignificant stage in the development of the fraud detectionsystem(is stage involves the use of data mining techniquesto transform the data from its raw form into the requiredformat to be used by the SVC for the detection and iden-tification of health insurance claims fraud

(e data preprocessing stage involves the removal ofunwanted customers missing records and data smooth-ening (is is to make sure that only useful and relevantinformation is extracted for the next process

Before the preprocessing the data were imported fromMS Excel CSV format into MySQL to a created databasecalled NHIS (e imported data include the electronicHealth Insurance Claims (e-HIC) data and the HIC tariffdatasets as tables imported into the NHIS (e e-HIC datapreprocessing involves the following steps (1) claims datafiltering and selection (2) feature selection and extractionand (3) feature adjustment

Class 1Margin

Class 2

Support vectors

Hyperplane

Figure 3 Nonlinear separating hyperplane for the nonseparable case of SVM

Table 1 Summarized kernel functions used

Kernel name Parameters Kernel functionRadial basis function(RBF) c isin R k(xi xj) eminus cxi minus xj2

Polynomial function c isin R d isin N k(xi xj) (xi middot xj + c)d

6 Journal of Engineering

(e WEKA machine learning and knowledge analysisenvironment were used for feature selection and extractionwhile the data processing codes are written in the MATLABtechnical computing environment (e developed MAT-LAB-based decision support engine was connected viaMYSQL using the script shown in Figure 6

Preprocessing of the raw data involves claims cost val-idity checks(e tariff dataset consists of the approved tariffsfor each diagnostic-related group which was strictlyenforced to clean the data before further processing Claimsare partitioned into two namely (1) claims with the validand approved cost within each DRG and (2) claims withinvalid costs (those above the approved tariffs within eachDRG)

With the recent increase in the volume of real datasetand dimensionality of the claims data there is the urgentneed for a faster more reliable and cost-effective datamining technique for classification models (e data miningtechniques require the extraction of a smaller and optimizedset of features that can be obtained by removing largelyredundant irrelevant and unnecessary features for the classprediction [41]

Feature selection algorithms are utilized to extract aminimal subset of attributes such that the resulting prob-ability distribution of data classes is close to the originaldistribution obtained using all attributes Based on the ideaof survival of the fittest a new population is constructed tocomply with fittest rules in the current population as well asthe offspring of these rules Offsprings are generated byapplying genetic operators such as crossover and mutation(e process of offspring generation continues until it evolvesa population N where every rule in N satisfies the fitnessthreshold With an initial population of 20 instances gen-eration continued till the 20th generation with crossoverprobability of 06 and mutation probability of 0033(e selected features based on genetic algorithms are

ldquoAttendance daterdquo ldquoHospital coderdquo ldquoGDRG coderdquo ldquoServicebillrdquo and ldquoDrug billrdquo (ese are the features selectedextracted and used as the basis for the optimization problemformulated below

Minimize Totalcos t f Sbill Dbill( 1113857

subject to 1113944

n

i1Sibill1113872 1113873le Gtariff forall i i 1 2 n

1113944

n

i1Dibill1113872 1113873le Dtariff forallj j 1 2 n

Sbill is the Service bill

Dbill is theDrug bill(24)

(e GA e-HIC dataset is subjected to SVM trainingusing 70 of the dataset and 30 for testing as depicted inFigure 7

(e e-HIC dataset which passes the preprocessing stagethat is the valid claims was used for SVM training andtesting(e best data those that meet the genetic algorithmrsquoscriteria are classified first Each record of this dataset isclassified as either ldquoFraudulent Billsrdquo or ldquoLegal Billsrdquo

(e same SVM training and the testing dataset areapplied to the SVM algorithm for its performance analysis(e inbuilt MATLAB code for SVM classifiers was in-tegrated as one function for linear polynomial and RBFkernels (e claim datasets were partitioned for the classifiertraining testing and validation 70 of the dataset was usedfor training and 30 used for testing (e linear poly-nomial and radial basis function SVM classification kernelswere used with ten-fold cross validation for each kernel andthe results averaged For the polynomial classification kernela cubic polynomial was used (e RBF classification kernelused the SMO method [40] (is method ensures the

Fraud detection classifier

Duplicated claims Upcoded claims Unbundled claims Uncovered claims

SuccessLegal claim Fraudulent claim

End GSVMoptimization

NoYes

Valid claims

Fraud detectionmodel

Figure 4 Conceptual model design and development of the genetic support vector machines

Journal of Engineering 7

handling of large data sizes as it does data transformationthrough kernelization After running many instances andvarying parameters for RBF a variance of 09 gave betterresults as it corresponded well with the datasets for theclassification After each classification the correct rate iscalculated and the confusion matrix extracted (e confu-sion matrix gives a count for the true legal true fraudulentfalse legal false fraudulent and inconclusive bills

(i) True legal bills this consists of the number of ldquoLegalBillsrdquo which were correctly classified as ldquoLegal Billsrdquoby the classifier

(ii) True fraudulent bills this consists of the number ofldquoFraudulent Billsrdquo which were correctly classified asldquoFraudulent Billsrdquo by the classifier

(iii) False legal bills this consists of the bills classified asldquoLegal Billsrdquo even though they are not (at is theseare wrongly classified as ldquoLegal Billsrdquo by the kernelused

(iv) False fraudulent bills the classifier also wronglyclassified bills as fraudulent (e confusion matrixgives a count of these wrongly or incorrectly clas-sified bills

Data

Validation

Feature subset selection byRoulette wheel

Training SVM classifier

Trained SVMclassifier

Success

Evaluate fitness

Recombination

Crossover

Mutation

Population

Generation

Optimized SVMhyperparameter (C y) store in PPD

Frauddetection

model

Yes

No

Training dataset Testing dataset

Testing datasetwith feature

selection subset

Training datasetwith feature

selection subset

Figure 5 Flow chart for design and development of the genetic support vector machines

8 Journal of Engineering

(v) Inconclusive bills these consist of nonclassifiedbills

(e correct rate this is calculated as the total number ofcorrectly classified bills namely the true legal bills and truefraudulent bills divided by the total number of bills used forthe classification

correct rate number of TLB + number of TFB

total number of bills (TB) (25)

where TLB True Legal Bills TFB True Fraudulent Bills

accuracy (1 minus Error) TP + TN

TP + TN + FP + FN Pr(C)

(26)

(e probability of a correct classification

411 Sensitivity (is is the statistical measure of the pro-portion of actual fraudulent claims which are correctlydetected

sensitivity TP

TP + FNTPPP

(27)

412 Specificity (is is the statistical measure of the pro-portion of negative fraudulent claims which are correctlyclassified

specificity TN

TN + FP

TNNP

(28)

42 GSVM Fraud Detection System Implementation andTesting (e decision support system comprises four mainmodules integrated together namely (1) algorithm imple-mentation using MATLAB technical computing platform(2) development of graphical user interface (GUI) for theHIC fraud detection system which consists of uploading andprocessing of claims management (3) system administratormanagement and (4) postprocessing of detection andclassification results

Figure 6 MATLAB-based decision support engine connection to the database

SVMtraining

and testingdataset

GAe-HICdata

Creation ofclaims record

database

Claimsfiltering and

selection

Featureselection

andextraction

Featureadjustment

Datanormalization

Datapreprocessing

Figure 7 Data preprocessing for SVM training and testing

Journal of Engineering 9

(e front end of the detection system was developedusing XAMPP a free and open-source cross-platform webserver solution stack package developed by Apache Friends[42] consisting mainly of the Apache HTTP ServerMariaDB database and interpreters for scripts written in the

PHP and Perl programming languages [42] XAMPP standsfor Cross-Platform (X) Apache (A) MariaDB (M) PHP (P)and Perl (P) (e Health Insurance Claims Fraud DetectionSystem (HICFDS) was developed using MATLAB technicalcomputing environment with the capability to connect to an

MySQL database

Autocreation of resultsdatabase

Data flowchart for NHIS claims fraud detectionsystem process

Upload in GUI

Model amp GSVMalgorithm

Detected results

DEVELOPED GUI

Engine

NHIS claims dataset

NHIS FRAUD DETECTION SYSTEM USING GSVM

Exploratorydata

analysis

Figure 8 System implementation architecture for HICFDS

Figure 9 Detection results control portal interface

10 Journal of Engineering

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

(ese slack variables help to find the hyperplane thatprovides the minimum number of training errorsModifying equation (4) to include the slack variableyields

Minimizeαb ξi

||ω||2

2+ C 1113944

n

i1ξi

subject to αi 1 minus yi ωTxi + b( 1113857( 1113857 + ξi minus 1ge 0 ξi ge 0

(13)

(eparameter C is a regularization parameter that tradesoff the wide margin with a small number of margin failures(e parameter C is finite (e larger the C value the moresignificant the error

(e KarushndashKuhnndashTucker (KKT) conditions are nec-essary to ensure optimality of the solution to a nonlinearprogramming problem

yi ωTxi + b1113872 11138731113872 1113873 minus 1ge 0 i 1 2 l foralli

αi yi ωTxi + b1113872 11138731113872 1113873 minus 1 0 αi ge 0 foralli

(14)

(e KKT conditions for the primal problem are used inthe nonseparable case after which the primal Lagrangianbecomes

LP ||ω||2

2+ c 1113944

n

i1ξi minus 1113944

j

i1αi yi ωT

xi + b1113966 11139671113872 1113873 minus 1 + ξi1113872 1113873

minus 1113944n

i1βiξi

(15)

With βi as the Lagrange multipliers to enforce positivityof the slack variables (ξi) and applying the KKT conditionsto the primal problem yields

zLP

zωu

ωu minus 1113944n

i1αiyixiu 0

zLP

zwu

C minus αiyi 0

zLP

zξi

C minus αi minus βi 0

αi yi ωTxi + b1113966 11139671113872 1113873 minus 1 + ξi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi ge 0

αi βi ξi ge 0 C ξi ge 0

i 1 2 n and u 1 2 d

zLP

zωu

ωu minus 1113944n

i1αiyixiu 0

zLP

zwu

C minus αiyi 0

(16)

where the parameter d represents the dimension of thedataset

Observing the expressions obtained above after applyingKKTconditions yields ξi 0 for αi ltC since βi C minus αi ne 0(is implies that any training point for which 0 lt αi ltC willbe taken to compute for b as a data point that does not crossthe boundary

αi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi gt 0

(17)

(is does not participate in the derivation of the sepa-rating function with αi C and ξi gt 0

αi 0

yi ωTxi + b1113872 1113873 minus 1 + ξi 0

(18)

Nonlinear SVM maps the training samples from theinput space into a higher-dimensional feature space via akernel mapping function F In the dual Lagrangian functionthe inner products are replaced by the kernel function

Φ xi( 1113857 middotΦ xj1113872 11138731113872 1113873 k xi xj1113872 1113873 (19)

Effective kernels are used in finding the separating hy-perplane without high computational resources (e non-linear SVM dual Lagrangian

LD(α) 1113944

n

i1αi minus

12

1113944

n

i11113944

n

j1αiαjyiyjk xi middot xj1113872 1113873 (20)

subject to

1113944

n

i1αiyi 0 0le αi i 1 n (21)

ξ = 0

Support vectorSupport vector

ξ lt 1

b

W

Misclassifiedpoint

ξ gt 1 Margin

Figure 2 Linear separating hyperplanes for the nonseparable caseof SVC by introducing the slack variable (ξ)

Journal of Engineering 5

(is is like that of the generalized linear case(e nonlinear SVM separating hyperplane is illustrated

in Figure 3 with the support vectors class labels andmargin(is model can be solved by the method of optimization

in the separable case (erefore the optimal hyperplane hasthe following form

f(x) 1113944n

i1αiyik xi x( 1113857 + b (22)

where b is the decision boundary from the origin Henceseparating newly arrived dataset x implies that

g(x) sign(f(x)) (23)

However feasible kernels must be symmetrical ie thematrix K with the component k(xi xj) is positive semi-definite and satisfies Mercerrsquos condition given in [39 40](e summarized kernel functions considered in this workare given in Table 1

(ese kernels satisfied Mercerrsquos condition with RBFor Gaussian kernel which is the widely used kernelfunction from the literature (e RBF has an advantage ofadding a single free parameter cgt 0 which controls thewidth of the RBF kernel as c 12σ2 where σ2 is thevariance of the resulting Gaussian hypersphere (elinear kernel is given as k(xi xj) xi middot xj Consequentlythe training of SVMs used the solution of the QP opti-mization problem (e above mathematical formulationsform the foundation for the development and de-ployment of genetic support vector machines as thedecision support tool for detecting and classifying healthinsurance fraudulent claims In recent times decision-making activities of knowledge-intensive enterprisesdepend holistically on the successful classification of datapatterns despite time and computational resources re-quired to achieve the results due to the complexity as-sociated with the dataset and its size

4 Methodology for GSVM Fraud Detection

(e systematic approach adopted for the design and de-velopment of genetic support vector machines for healthinsurance claims fraud detection is presented in the

conceptual framework in Figure 4 and the flow chartimplementation in Figure 5

(e conceptual framework incorporates the designand development of key algorithms that enable submittedclaims data to be analysed and a model to be developedfor testing and validation (e flow chart presents thealgorithm implemented based on theoretical foundationsin incorporating genetic algorithms and support vectormachines two useful machine learning algorithms nec-essary for fraud detection (eir combined use in thedetection process generates accurate results (e meth-odology for the design and development of geneticsupport vector machines as presented above consists ofthree (3) significant steps namely (1) data preprocessing(2) classification engine development and (3) datapostprocessing

41 Data Preprocessing (e data preprocessing is the firstsignificant stage in the development of the fraud detectionsystem(is stage involves the use of data mining techniquesto transform the data from its raw form into the requiredformat to be used by the SVC for the detection and iden-tification of health insurance claims fraud

(e data preprocessing stage involves the removal ofunwanted customers missing records and data smooth-ening (is is to make sure that only useful and relevantinformation is extracted for the next process

Before the preprocessing the data were imported fromMS Excel CSV format into MySQL to a created databasecalled NHIS (e imported data include the electronicHealth Insurance Claims (e-HIC) data and the HIC tariffdatasets as tables imported into the NHIS (e e-HIC datapreprocessing involves the following steps (1) claims datafiltering and selection (2) feature selection and extractionand (3) feature adjustment

Class 1Margin

Class 2

Support vectors

Hyperplane

Figure 3 Nonlinear separating hyperplane for the nonseparable case of SVM

Table 1 Summarized kernel functions used

Kernel name Parameters Kernel functionRadial basis function(RBF) c isin R k(xi xj) eminus cxi minus xj2

Polynomial function c isin R d isin N k(xi xj) (xi middot xj + c)d

6 Journal of Engineering

(e WEKA machine learning and knowledge analysisenvironment were used for feature selection and extractionwhile the data processing codes are written in the MATLABtechnical computing environment (e developed MAT-LAB-based decision support engine was connected viaMYSQL using the script shown in Figure 6

Preprocessing of the raw data involves claims cost val-idity checks(e tariff dataset consists of the approved tariffsfor each diagnostic-related group which was strictlyenforced to clean the data before further processing Claimsare partitioned into two namely (1) claims with the validand approved cost within each DRG and (2) claims withinvalid costs (those above the approved tariffs within eachDRG)

With the recent increase in the volume of real datasetand dimensionality of the claims data there is the urgentneed for a faster more reliable and cost-effective datamining technique for classification models (e data miningtechniques require the extraction of a smaller and optimizedset of features that can be obtained by removing largelyredundant irrelevant and unnecessary features for the classprediction [41]

Feature selection algorithms are utilized to extract aminimal subset of attributes such that the resulting prob-ability distribution of data classes is close to the originaldistribution obtained using all attributes Based on the ideaof survival of the fittest a new population is constructed tocomply with fittest rules in the current population as well asthe offspring of these rules Offsprings are generated byapplying genetic operators such as crossover and mutation(e process of offspring generation continues until it evolvesa population N where every rule in N satisfies the fitnessthreshold With an initial population of 20 instances gen-eration continued till the 20th generation with crossoverprobability of 06 and mutation probability of 0033(e selected features based on genetic algorithms are

ldquoAttendance daterdquo ldquoHospital coderdquo ldquoGDRG coderdquo ldquoServicebillrdquo and ldquoDrug billrdquo (ese are the features selectedextracted and used as the basis for the optimization problemformulated below

Minimize Totalcos t f Sbill Dbill( 1113857

subject to 1113944

n

i1Sibill1113872 1113873le Gtariff forall i i 1 2 n

1113944

n

i1Dibill1113872 1113873le Dtariff forallj j 1 2 n

Sbill is the Service bill

Dbill is theDrug bill(24)

(e GA e-HIC dataset is subjected to SVM trainingusing 70 of the dataset and 30 for testing as depicted inFigure 7

(e e-HIC dataset which passes the preprocessing stagethat is the valid claims was used for SVM training andtesting(e best data those that meet the genetic algorithmrsquoscriteria are classified first Each record of this dataset isclassified as either ldquoFraudulent Billsrdquo or ldquoLegal Billsrdquo

(e same SVM training and the testing dataset areapplied to the SVM algorithm for its performance analysis(e inbuilt MATLAB code for SVM classifiers was in-tegrated as one function for linear polynomial and RBFkernels (e claim datasets were partitioned for the classifiertraining testing and validation 70 of the dataset was usedfor training and 30 used for testing (e linear poly-nomial and radial basis function SVM classification kernelswere used with ten-fold cross validation for each kernel andthe results averaged For the polynomial classification kernela cubic polynomial was used (e RBF classification kernelused the SMO method [40] (is method ensures the

Fraud detection classifier

Duplicated claims Upcoded claims Unbundled claims Uncovered claims

SuccessLegal claim Fraudulent claim

End GSVMoptimization

NoYes

Valid claims

Fraud detectionmodel

Figure 4 Conceptual model design and development of the genetic support vector machines

Journal of Engineering 7

handling of large data sizes as it does data transformationthrough kernelization After running many instances andvarying parameters for RBF a variance of 09 gave betterresults as it corresponded well with the datasets for theclassification After each classification the correct rate iscalculated and the confusion matrix extracted (e confu-sion matrix gives a count for the true legal true fraudulentfalse legal false fraudulent and inconclusive bills

(i) True legal bills this consists of the number of ldquoLegalBillsrdquo which were correctly classified as ldquoLegal Billsrdquoby the classifier

(ii) True fraudulent bills this consists of the number ofldquoFraudulent Billsrdquo which were correctly classified asldquoFraudulent Billsrdquo by the classifier

(iii) False legal bills this consists of the bills classified asldquoLegal Billsrdquo even though they are not (at is theseare wrongly classified as ldquoLegal Billsrdquo by the kernelused

(iv) False fraudulent bills the classifier also wronglyclassified bills as fraudulent (e confusion matrixgives a count of these wrongly or incorrectly clas-sified bills

Data

Validation

Feature subset selection byRoulette wheel

Training SVM classifier

Trained SVMclassifier

Success

Evaluate fitness

Recombination

Crossover

Mutation

Population

Generation

Optimized SVMhyperparameter (C y) store in PPD

Frauddetection

model

Yes

No

Training dataset Testing dataset

Testing datasetwith feature

selection subset

Training datasetwith feature

selection subset

Figure 5 Flow chart for design and development of the genetic support vector machines

8 Journal of Engineering

(v) Inconclusive bills these consist of nonclassifiedbills

(e correct rate this is calculated as the total number ofcorrectly classified bills namely the true legal bills and truefraudulent bills divided by the total number of bills used forthe classification

correct rate number of TLB + number of TFB

total number of bills (TB) (25)

where TLB True Legal Bills TFB True Fraudulent Bills

accuracy (1 minus Error) TP + TN

TP + TN + FP + FN Pr(C)

(26)

(e probability of a correct classification

411 Sensitivity (is is the statistical measure of the pro-portion of actual fraudulent claims which are correctlydetected

sensitivity TP

TP + FNTPPP

(27)

412 Specificity (is is the statistical measure of the pro-portion of negative fraudulent claims which are correctlyclassified

specificity TN

TN + FP

TNNP

(28)

42 GSVM Fraud Detection System Implementation andTesting (e decision support system comprises four mainmodules integrated together namely (1) algorithm imple-mentation using MATLAB technical computing platform(2) development of graphical user interface (GUI) for theHIC fraud detection system which consists of uploading andprocessing of claims management (3) system administratormanagement and (4) postprocessing of detection andclassification results

Figure 6 MATLAB-based decision support engine connection to the database

SVMtraining

and testingdataset

GAe-HICdata

Creation ofclaims record

database

Claimsfiltering and

selection

Featureselection

andextraction

Featureadjustment

Datanormalization

Datapreprocessing

Figure 7 Data preprocessing for SVM training and testing

Journal of Engineering 9

(e front end of the detection system was developedusing XAMPP a free and open-source cross-platform webserver solution stack package developed by Apache Friends[42] consisting mainly of the Apache HTTP ServerMariaDB database and interpreters for scripts written in the

PHP and Perl programming languages [42] XAMPP standsfor Cross-Platform (X) Apache (A) MariaDB (M) PHP (P)and Perl (P) (e Health Insurance Claims Fraud DetectionSystem (HICFDS) was developed using MATLAB technicalcomputing environment with the capability to connect to an

MySQL database

Autocreation of resultsdatabase

Data flowchart for NHIS claims fraud detectionsystem process

Upload in GUI

Model amp GSVMalgorithm

Detected results

DEVELOPED GUI

Engine

NHIS claims dataset

NHIS FRAUD DETECTION SYSTEM USING GSVM

Exploratorydata

analysis

Figure 8 System implementation architecture for HICFDS

Figure 9 Detection results control portal interface

10 Journal of Engineering

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

(is is like that of the generalized linear case(e nonlinear SVM separating hyperplane is illustrated

in Figure 3 with the support vectors class labels andmargin(is model can be solved by the method of optimization

in the separable case (erefore the optimal hyperplane hasthe following form

f(x) 1113944n

i1αiyik xi x( 1113857 + b (22)

where b is the decision boundary from the origin Henceseparating newly arrived dataset x implies that

g(x) sign(f(x)) (23)

However feasible kernels must be symmetrical ie thematrix K with the component k(xi xj) is positive semi-definite and satisfies Mercerrsquos condition given in [39 40](e summarized kernel functions considered in this workare given in Table 1

(ese kernels satisfied Mercerrsquos condition with RBFor Gaussian kernel which is the widely used kernelfunction from the literature (e RBF has an advantage ofadding a single free parameter cgt 0 which controls thewidth of the RBF kernel as c 12σ2 where σ2 is thevariance of the resulting Gaussian hypersphere (elinear kernel is given as k(xi xj) xi middot xj Consequentlythe training of SVMs used the solution of the QP opti-mization problem (e above mathematical formulationsform the foundation for the development and de-ployment of genetic support vector machines as thedecision support tool for detecting and classifying healthinsurance fraudulent claims In recent times decision-making activities of knowledge-intensive enterprisesdepend holistically on the successful classification of datapatterns despite time and computational resources re-quired to achieve the results due to the complexity as-sociated with the dataset and its size

4 Methodology for GSVM Fraud Detection

(e systematic approach adopted for the design and de-velopment of genetic support vector machines for healthinsurance claims fraud detection is presented in the

conceptual framework in Figure 4 and the flow chartimplementation in Figure 5

(e conceptual framework incorporates the designand development of key algorithms that enable submittedclaims data to be analysed and a model to be developedfor testing and validation (e flow chart presents thealgorithm implemented based on theoretical foundationsin incorporating genetic algorithms and support vectormachines two useful machine learning algorithms nec-essary for fraud detection (eir combined use in thedetection process generates accurate results (e meth-odology for the design and development of geneticsupport vector machines as presented above consists ofthree (3) significant steps namely (1) data preprocessing(2) classification engine development and (3) datapostprocessing

41 Data Preprocessing (e data preprocessing is the firstsignificant stage in the development of the fraud detectionsystem(is stage involves the use of data mining techniquesto transform the data from its raw form into the requiredformat to be used by the SVC for the detection and iden-tification of health insurance claims fraud

(e data preprocessing stage involves the removal ofunwanted customers missing records and data smooth-ening (is is to make sure that only useful and relevantinformation is extracted for the next process

Before the preprocessing the data were imported fromMS Excel CSV format into MySQL to a created databasecalled NHIS (e imported data include the electronicHealth Insurance Claims (e-HIC) data and the HIC tariffdatasets as tables imported into the NHIS (e e-HIC datapreprocessing involves the following steps (1) claims datafiltering and selection (2) feature selection and extractionand (3) feature adjustment

Class 1Margin

Class 2

Support vectors

Hyperplane

Figure 3 Nonlinear separating hyperplane for the nonseparable case of SVM

Table 1 Summarized kernel functions used

Kernel name Parameters Kernel functionRadial basis function(RBF) c isin R k(xi xj) eminus cxi minus xj2

Polynomial function c isin R d isin N k(xi xj) (xi middot xj + c)d

6 Journal of Engineering

(e WEKA machine learning and knowledge analysisenvironment were used for feature selection and extractionwhile the data processing codes are written in the MATLABtechnical computing environment (e developed MAT-LAB-based decision support engine was connected viaMYSQL using the script shown in Figure 6

Preprocessing of the raw data involves claims cost val-idity checks(e tariff dataset consists of the approved tariffsfor each diagnostic-related group which was strictlyenforced to clean the data before further processing Claimsare partitioned into two namely (1) claims with the validand approved cost within each DRG and (2) claims withinvalid costs (those above the approved tariffs within eachDRG)

With the recent increase in the volume of real datasetand dimensionality of the claims data there is the urgentneed for a faster more reliable and cost-effective datamining technique for classification models (e data miningtechniques require the extraction of a smaller and optimizedset of features that can be obtained by removing largelyredundant irrelevant and unnecessary features for the classprediction [41]

Feature selection algorithms are utilized to extract aminimal subset of attributes such that the resulting prob-ability distribution of data classes is close to the originaldistribution obtained using all attributes Based on the ideaof survival of the fittest a new population is constructed tocomply with fittest rules in the current population as well asthe offspring of these rules Offsprings are generated byapplying genetic operators such as crossover and mutation(e process of offspring generation continues until it evolvesa population N where every rule in N satisfies the fitnessthreshold With an initial population of 20 instances gen-eration continued till the 20th generation with crossoverprobability of 06 and mutation probability of 0033(e selected features based on genetic algorithms are

ldquoAttendance daterdquo ldquoHospital coderdquo ldquoGDRG coderdquo ldquoServicebillrdquo and ldquoDrug billrdquo (ese are the features selectedextracted and used as the basis for the optimization problemformulated below

Minimize Totalcos t f Sbill Dbill( 1113857

subject to 1113944

n

i1Sibill1113872 1113873le Gtariff forall i i 1 2 n

1113944

n

i1Dibill1113872 1113873le Dtariff forallj j 1 2 n

Sbill is the Service bill

Dbill is theDrug bill(24)

(e GA e-HIC dataset is subjected to SVM trainingusing 70 of the dataset and 30 for testing as depicted inFigure 7

(e e-HIC dataset which passes the preprocessing stagethat is the valid claims was used for SVM training andtesting(e best data those that meet the genetic algorithmrsquoscriteria are classified first Each record of this dataset isclassified as either ldquoFraudulent Billsrdquo or ldquoLegal Billsrdquo

(e same SVM training and the testing dataset areapplied to the SVM algorithm for its performance analysis(e inbuilt MATLAB code for SVM classifiers was in-tegrated as one function for linear polynomial and RBFkernels (e claim datasets were partitioned for the classifiertraining testing and validation 70 of the dataset was usedfor training and 30 used for testing (e linear poly-nomial and radial basis function SVM classification kernelswere used with ten-fold cross validation for each kernel andthe results averaged For the polynomial classification kernela cubic polynomial was used (e RBF classification kernelused the SMO method [40] (is method ensures the

Fraud detection classifier

Duplicated claims Upcoded claims Unbundled claims Uncovered claims

SuccessLegal claim Fraudulent claim

End GSVMoptimization

NoYes

Valid claims

Fraud detectionmodel

Figure 4 Conceptual model design and development of the genetic support vector machines

Journal of Engineering 7

handling of large data sizes as it does data transformationthrough kernelization After running many instances andvarying parameters for RBF a variance of 09 gave betterresults as it corresponded well with the datasets for theclassification After each classification the correct rate iscalculated and the confusion matrix extracted (e confu-sion matrix gives a count for the true legal true fraudulentfalse legal false fraudulent and inconclusive bills

(i) True legal bills this consists of the number of ldquoLegalBillsrdquo which were correctly classified as ldquoLegal Billsrdquoby the classifier

(ii) True fraudulent bills this consists of the number ofldquoFraudulent Billsrdquo which were correctly classified asldquoFraudulent Billsrdquo by the classifier

(iii) False legal bills this consists of the bills classified asldquoLegal Billsrdquo even though they are not (at is theseare wrongly classified as ldquoLegal Billsrdquo by the kernelused

(iv) False fraudulent bills the classifier also wronglyclassified bills as fraudulent (e confusion matrixgives a count of these wrongly or incorrectly clas-sified bills

Data

Validation

Feature subset selection byRoulette wheel

Training SVM classifier

Trained SVMclassifier

Success

Evaluate fitness

Recombination

Crossover

Mutation

Population

Generation

Optimized SVMhyperparameter (C y) store in PPD

Frauddetection

model

Yes

No

Training dataset Testing dataset

Testing datasetwith feature

selection subset

Training datasetwith feature

selection subset

Figure 5 Flow chart for design and development of the genetic support vector machines

8 Journal of Engineering

(v) Inconclusive bills these consist of nonclassifiedbills

(e correct rate this is calculated as the total number ofcorrectly classified bills namely the true legal bills and truefraudulent bills divided by the total number of bills used forthe classification

correct rate number of TLB + number of TFB

total number of bills (TB) (25)

where TLB True Legal Bills TFB True Fraudulent Bills

accuracy (1 minus Error) TP + TN

TP + TN + FP + FN Pr(C)

(26)

(e probability of a correct classification

411 Sensitivity (is is the statistical measure of the pro-portion of actual fraudulent claims which are correctlydetected

sensitivity TP

TP + FNTPPP

(27)

412 Specificity (is is the statistical measure of the pro-portion of negative fraudulent claims which are correctlyclassified

specificity TN

TN + FP

TNNP

(28)

42 GSVM Fraud Detection System Implementation andTesting (e decision support system comprises four mainmodules integrated together namely (1) algorithm imple-mentation using MATLAB technical computing platform(2) development of graphical user interface (GUI) for theHIC fraud detection system which consists of uploading andprocessing of claims management (3) system administratormanagement and (4) postprocessing of detection andclassification results

Figure 6 MATLAB-based decision support engine connection to the database

SVMtraining

and testingdataset

GAe-HICdata

Creation ofclaims record

database

Claimsfiltering and

selection

Featureselection

andextraction

Featureadjustment

Datanormalization

Datapreprocessing

Figure 7 Data preprocessing for SVM training and testing

Journal of Engineering 9

(e front end of the detection system was developedusing XAMPP a free and open-source cross-platform webserver solution stack package developed by Apache Friends[42] consisting mainly of the Apache HTTP ServerMariaDB database and interpreters for scripts written in the

PHP and Perl programming languages [42] XAMPP standsfor Cross-Platform (X) Apache (A) MariaDB (M) PHP (P)and Perl (P) (e Health Insurance Claims Fraud DetectionSystem (HICFDS) was developed using MATLAB technicalcomputing environment with the capability to connect to an

MySQL database

Autocreation of resultsdatabase

Data flowchart for NHIS claims fraud detectionsystem process

Upload in GUI

Model amp GSVMalgorithm

Detected results

DEVELOPED GUI

Engine

NHIS claims dataset

NHIS FRAUD DETECTION SYSTEM USING GSVM

Exploratorydata

analysis

Figure 8 System implementation architecture for HICFDS

Figure 9 Detection results control portal interface

10 Journal of Engineering

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

(e WEKA machine learning and knowledge analysisenvironment were used for feature selection and extractionwhile the data processing codes are written in the MATLABtechnical computing environment (e developed MAT-LAB-based decision support engine was connected viaMYSQL using the script shown in Figure 6

Preprocessing of the raw data involves claims cost val-idity checks(e tariff dataset consists of the approved tariffsfor each diagnostic-related group which was strictlyenforced to clean the data before further processing Claimsare partitioned into two namely (1) claims with the validand approved cost within each DRG and (2) claims withinvalid costs (those above the approved tariffs within eachDRG)

With the recent increase in the volume of real datasetand dimensionality of the claims data there is the urgentneed for a faster more reliable and cost-effective datamining technique for classification models (e data miningtechniques require the extraction of a smaller and optimizedset of features that can be obtained by removing largelyredundant irrelevant and unnecessary features for the classprediction [41]

Feature selection algorithms are utilized to extract aminimal subset of attributes such that the resulting prob-ability distribution of data classes is close to the originaldistribution obtained using all attributes Based on the ideaof survival of the fittest a new population is constructed tocomply with fittest rules in the current population as well asthe offspring of these rules Offsprings are generated byapplying genetic operators such as crossover and mutation(e process of offspring generation continues until it evolvesa population N where every rule in N satisfies the fitnessthreshold With an initial population of 20 instances gen-eration continued till the 20th generation with crossoverprobability of 06 and mutation probability of 0033(e selected features based on genetic algorithms are

ldquoAttendance daterdquo ldquoHospital coderdquo ldquoGDRG coderdquo ldquoServicebillrdquo and ldquoDrug billrdquo (ese are the features selectedextracted and used as the basis for the optimization problemformulated below

Minimize Totalcos t f Sbill Dbill( 1113857

subject to 1113944

n

i1Sibill1113872 1113873le Gtariff forall i i 1 2 n

1113944

n

i1Dibill1113872 1113873le Dtariff forallj j 1 2 n

Sbill is the Service bill

Dbill is theDrug bill(24)

(e GA e-HIC dataset is subjected to SVM trainingusing 70 of the dataset and 30 for testing as depicted inFigure 7

(e e-HIC dataset which passes the preprocessing stagethat is the valid claims was used for SVM training andtesting(e best data those that meet the genetic algorithmrsquoscriteria are classified first Each record of this dataset isclassified as either ldquoFraudulent Billsrdquo or ldquoLegal Billsrdquo

(e same SVM training and the testing dataset areapplied to the SVM algorithm for its performance analysis(e inbuilt MATLAB code for SVM classifiers was in-tegrated as one function for linear polynomial and RBFkernels (e claim datasets were partitioned for the classifiertraining testing and validation 70 of the dataset was usedfor training and 30 used for testing (e linear poly-nomial and radial basis function SVM classification kernelswere used with ten-fold cross validation for each kernel andthe results averaged For the polynomial classification kernela cubic polynomial was used (e RBF classification kernelused the SMO method [40] (is method ensures the

Fraud detection classifier

Duplicated claims Upcoded claims Unbundled claims Uncovered claims

SuccessLegal claim Fraudulent claim

End GSVMoptimization

NoYes

Valid claims

Fraud detectionmodel

Figure 4 Conceptual model design and development of the genetic support vector machines

Journal of Engineering 7

handling of large data sizes as it does data transformationthrough kernelization After running many instances andvarying parameters for RBF a variance of 09 gave betterresults as it corresponded well with the datasets for theclassification After each classification the correct rate iscalculated and the confusion matrix extracted (e confu-sion matrix gives a count for the true legal true fraudulentfalse legal false fraudulent and inconclusive bills

(i) True legal bills this consists of the number of ldquoLegalBillsrdquo which were correctly classified as ldquoLegal Billsrdquoby the classifier

(ii) True fraudulent bills this consists of the number ofldquoFraudulent Billsrdquo which were correctly classified asldquoFraudulent Billsrdquo by the classifier

(iii) False legal bills this consists of the bills classified asldquoLegal Billsrdquo even though they are not (at is theseare wrongly classified as ldquoLegal Billsrdquo by the kernelused

(iv) False fraudulent bills the classifier also wronglyclassified bills as fraudulent (e confusion matrixgives a count of these wrongly or incorrectly clas-sified bills

Data

Validation

Feature subset selection byRoulette wheel

Training SVM classifier

Trained SVMclassifier

Success

Evaluate fitness

Recombination

Crossover

Mutation

Population

Generation

Optimized SVMhyperparameter (C y) store in PPD

Frauddetection

model

Yes

No

Training dataset Testing dataset

Testing datasetwith feature

selection subset

Training datasetwith feature

selection subset

Figure 5 Flow chart for design and development of the genetic support vector machines

8 Journal of Engineering

(v) Inconclusive bills these consist of nonclassifiedbills

(e correct rate this is calculated as the total number ofcorrectly classified bills namely the true legal bills and truefraudulent bills divided by the total number of bills used forthe classification

correct rate number of TLB + number of TFB

total number of bills (TB) (25)

where TLB True Legal Bills TFB True Fraudulent Bills

accuracy (1 minus Error) TP + TN

TP + TN + FP + FN Pr(C)

(26)

(e probability of a correct classification

411 Sensitivity (is is the statistical measure of the pro-portion of actual fraudulent claims which are correctlydetected

sensitivity TP

TP + FNTPPP

(27)

412 Specificity (is is the statistical measure of the pro-portion of negative fraudulent claims which are correctlyclassified

specificity TN

TN + FP

TNNP

(28)

42 GSVM Fraud Detection System Implementation andTesting (e decision support system comprises four mainmodules integrated together namely (1) algorithm imple-mentation using MATLAB technical computing platform(2) development of graphical user interface (GUI) for theHIC fraud detection system which consists of uploading andprocessing of claims management (3) system administratormanagement and (4) postprocessing of detection andclassification results

Figure 6 MATLAB-based decision support engine connection to the database

SVMtraining

and testingdataset

GAe-HICdata

Creation ofclaims record

database

Claimsfiltering and

selection

Featureselection

andextraction

Featureadjustment

Datanormalization

Datapreprocessing

Figure 7 Data preprocessing for SVM training and testing

Journal of Engineering 9

(e front end of the detection system was developedusing XAMPP a free and open-source cross-platform webserver solution stack package developed by Apache Friends[42] consisting mainly of the Apache HTTP ServerMariaDB database and interpreters for scripts written in the

PHP and Perl programming languages [42] XAMPP standsfor Cross-Platform (X) Apache (A) MariaDB (M) PHP (P)and Perl (P) (e Health Insurance Claims Fraud DetectionSystem (HICFDS) was developed using MATLAB technicalcomputing environment with the capability to connect to an

MySQL database

Autocreation of resultsdatabase

Data flowchart for NHIS claims fraud detectionsystem process

Upload in GUI

Model amp GSVMalgorithm

Detected results

DEVELOPED GUI

Engine

NHIS claims dataset

NHIS FRAUD DETECTION SYSTEM USING GSVM

Exploratorydata

analysis

Figure 8 System implementation architecture for HICFDS

Figure 9 Detection results control portal interface

10 Journal of Engineering

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

handling of large data sizes as it does data transformationthrough kernelization After running many instances andvarying parameters for RBF a variance of 09 gave betterresults as it corresponded well with the datasets for theclassification After each classification the correct rate iscalculated and the confusion matrix extracted (e confu-sion matrix gives a count for the true legal true fraudulentfalse legal false fraudulent and inconclusive bills

(i) True legal bills this consists of the number of ldquoLegalBillsrdquo which were correctly classified as ldquoLegal Billsrdquoby the classifier

(ii) True fraudulent bills this consists of the number ofldquoFraudulent Billsrdquo which were correctly classified asldquoFraudulent Billsrdquo by the classifier

(iii) False legal bills this consists of the bills classified asldquoLegal Billsrdquo even though they are not (at is theseare wrongly classified as ldquoLegal Billsrdquo by the kernelused

(iv) False fraudulent bills the classifier also wronglyclassified bills as fraudulent (e confusion matrixgives a count of these wrongly or incorrectly clas-sified bills

Data

Validation

Feature subset selection byRoulette wheel

Training SVM classifier

Trained SVMclassifier

Success

Evaluate fitness

Recombination

Crossover

Mutation

Population

Generation

Optimized SVMhyperparameter (C y) store in PPD

Frauddetection

model

Yes

No

Training dataset Testing dataset

Testing datasetwith feature

selection subset

Training datasetwith feature

selection subset

Figure 5 Flow chart for design and development of the genetic support vector machines

8 Journal of Engineering

(v) Inconclusive bills these consist of nonclassifiedbills

(e correct rate this is calculated as the total number ofcorrectly classified bills namely the true legal bills and truefraudulent bills divided by the total number of bills used forthe classification

correct rate number of TLB + number of TFB

total number of bills (TB) (25)

where TLB True Legal Bills TFB True Fraudulent Bills

accuracy (1 minus Error) TP + TN

TP + TN + FP + FN Pr(C)

(26)

(e probability of a correct classification

411 Sensitivity (is is the statistical measure of the pro-portion of actual fraudulent claims which are correctlydetected

sensitivity TP

TP + FNTPPP

(27)

412 Specificity (is is the statistical measure of the pro-portion of negative fraudulent claims which are correctlyclassified

specificity TN

TN + FP

TNNP

(28)

42 GSVM Fraud Detection System Implementation andTesting (e decision support system comprises four mainmodules integrated together namely (1) algorithm imple-mentation using MATLAB technical computing platform(2) development of graphical user interface (GUI) for theHIC fraud detection system which consists of uploading andprocessing of claims management (3) system administratormanagement and (4) postprocessing of detection andclassification results

Figure 6 MATLAB-based decision support engine connection to the database

SVMtraining

and testingdataset

GAe-HICdata

Creation ofclaims record

database

Claimsfiltering and

selection

Featureselection

andextraction

Featureadjustment

Datanormalization

Datapreprocessing

Figure 7 Data preprocessing for SVM training and testing

Journal of Engineering 9

(e front end of the detection system was developedusing XAMPP a free and open-source cross-platform webserver solution stack package developed by Apache Friends[42] consisting mainly of the Apache HTTP ServerMariaDB database and interpreters for scripts written in the

PHP and Perl programming languages [42] XAMPP standsfor Cross-Platform (X) Apache (A) MariaDB (M) PHP (P)and Perl (P) (e Health Insurance Claims Fraud DetectionSystem (HICFDS) was developed using MATLAB technicalcomputing environment with the capability to connect to an

MySQL database

Autocreation of resultsdatabase

Data flowchart for NHIS claims fraud detectionsystem process

Upload in GUI

Model amp GSVMalgorithm

Detected results

DEVELOPED GUI

Engine

NHIS claims dataset

NHIS FRAUD DETECTION SYSTEM USING GSVM

Exploratorydata

analysis

Figure 8 System implementation architecture for HICFDS

Figure 9 Detection results control portal interface

10 Journal of Engineering

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

(v) Inconclusive bills these consist of nonclassifiedbills

(e correct rate this is calculated as the total number ofcorrectly classified bills namely the true legal bills and truefraudulent bills divided by the total number of bills used forthe classification

correct rate number of TLB + number of TFB

total number of bills (TB) (25)

where TLB True Legal Bills TFB True Fraudulent Bills

accuracy (1 minus Error) TP + TN

TP + TN + FP + FN Pr(C)

(26)

(e probability of a correct classification

411 Sensitivity (is is the statistical measure of the pro-portion of actual fraudulent claims which are correctlydetected

sensitivity TP

TP + FNTPPP

(27)

412 Specificity (is is the statistical measure of the pro-portion of negative fraudulent claims which are correctlyclassified

specificity TN

TN + FP

TNNP

(28)

42 GSVM Fraud Detection System Implementation andTesting (e decision support system comprises four mainmodules integrated together namely (1) algorithm imple-mentation using MATLAB technical computing platform(2) development of graphical user interface (GUI) for theHIC fraud detection system which consists of uploading andprocessing of claims management (3) system administratormanagement and (4) postprocessing of detection andclassification results

Figure 6 MATLAB-based decision support engine connection to the database

SVMtraining

and testingdataset

GAe-HICdata

Creation ofclaims record

database

Claimsfiltering and

selection

Featureselection

andextraction

Featureadjustment

Datanormalization

Datapreprocessing

Figure 7 Data preprocessing for SVM training and testing

Journal of Engineering 9

(e front end of the detection system was developedusing XAMPP a free and open-source cross-platform webserver solution stack package developed by Apache Friends[42] consisting mainly of the Apache HTTP ServerMariaDB database and interpreters for scripts written in the

PHP and Perl programming languages [42] XAMPP standsfor Cross-Platform (X) Apache (A) MariaDB (M) PHP (P)and Perl (P) (e Health Insurance Claims Fraud DetectionSystem (HICFDS) was developed using MATLAB technicalcomputing environment with the capability to connect to an

MySQL database

Autocreation of resultsdatabase

Data flowchart for NHIS claims fraud detectionsystem process

Upload in GUI

Model amp GSVMalgorithm

Detected results

DEVELOPED GUI

Engine

NHIS claims dataset

NHIS FRAUD DETECTION SYSTEM USING GSVM

Exploratorydata

analysis

Figure 8 System implementation architecture for HICFDS

Figure 9 Detection results control portal interface

10 Journal of Engineering

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

(e front end of the detection system was developedusing XAMPP a free and open-source cross-platform webserver solution stack package developed by Apache Friends[42] consisting mainly of the Apache HTTP ServerMariaDB database and interpreters for scripts written in the

PHP and Perl programming languages [42] XAMPP standsfor Cross-Platform (X) Apache (A) MariaDB (M) PHP (P)and Perl (P) (e Health Insurance Claims Fraud DetectionSystem (HICFDS) was developed using MATLAB technicalcomputing environment with the capability to connect to an

MySQL database

Autocreation of resultsdatabase

Data flowchart for NHIS claims fraud detectionsystem process

Upload in GUI

Model amp GSVMalgorithm

Detected results

DEVELOPED GUI

Engine

NHIS claims dataset

NHIS FRAUD DETECTION SYSTEM USING GSVM

Exploratorydata

analysis

Figure 8 System implementation architecture for HICFDS

Figure 9 Detection results control portal interface

10 Journal of Engineering

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

external database and a graphical user interface (GUI) forenhanced interactivity with users (e HICFDS consists ofseveral functional components namely (1) function forcomputing the descriptive statistics of raw and processeddata (2) preprocessing wrapper function for data handlingand processing and (3) MATLAB functions for GA Opti-mization and SVM Classification processes (e HICFDScomponents are depicted in Figure 8

(e results generated by the HICFDS are stored inMYSQL database (e results comprise three parts whichare legitimate claims report fraudulent claims and statisticsof the results (ese results are shown in Figure 9 (edeveloped GUI portal for the analysis of results obtainedfrom the classification of the submitted health insuranceclaims is displayed in Figure 9 By clicking on the fraudulentbutton in the GUI a pop-up menu generating the labelledFigure 10 is obtained for the claims dataset It shows thegrouping of detected fraudulent claim types in the datasets

For each classifier a 10-fold cross validation (CV) ofhyperparameters (C γ) from Patients Payment Data (PPD)was performed (e performance measured on GA optimi-zation tested several hyperparameters for the optimal SVM(e SVC training aims for the best SVC parameters (C c) inbuilding the HICFD classifier model (e developed classifieris evaluated using testing and validation data (e accuracy ofthe classifier is evaluated using cross validation (CV) to avoidoverfitting of SVC during training data (e random searchmethod was used for SVC parameter training where expo-nentially growing sequences of hyperparameters (C c) as apractical method to identify suitable parameters were used toidentify SVC parameters and obtain the best CV accuracy forthe classifier claims data samples Random search slightlyvaries from grid search Instead of searching over the entiregrid random search only evaluates a random sample of pointson the grid (is makes the random search a computational

method cheaper than a grid search Experimentally 10-foldCV was used as the measure of the training accuracy where70 of each sample was used for training and the remaining30 used for testing and validation

0Duplicated

claimsUncovered

serviceclaims

Grouping of detected frauds type in sample datasets

Overbilledclaims

Unbundledclaims

Fraud types

Sam

ple d

atas

et

Upcodedclaims

Impersonationclaims

50

100

150

200

250

300

350

400

100 dataset

300 dataset

500 dataset

750 dataset

1000 dataset

Figure 10 Fraud type distribution on the sample data sizes

Table 2 Sample data size and the corresponding fraud types

Fraud typesSample data size

100 300 500 750 1000Duplicate claims 2 4 4 4 0Uncovered service claims 4 56 65 109 406Overbilling claims 44 60 91 121 202Unbundled claims 0 18 10 54 0Upcoded claims 2 34 50 122 6Impersonation claims 0 2 10 23 34Total suspected claims 52 174 230 433 648

Table 3 Summary performance metrics of SVM classifiers onsamples sizes

Description

Kernelsused Data size

Averageaccuracyrate ()

Sensitivity() Specificity ()

Linear

100 7143 6000 7778300 7273 8421 000500 9180 9778 7500750 8442 9500 47061000 8295 8542 8000

Polynomial

100 7143 6667 7273300 7273 8824 2000500 9672 10000 8667750 8052 9636 40911000 8471 8367 8611

Radialbasisfunction

100 7143 5714 8571300 9545 9500 10000500 9918 10000 9630750 8256 9688 40911000 9091 10000 8298

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 11 Linear SVM on a sample claims dataset

Journal of Engineering 11

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

43 Data Postprocessing Validation of Classification Results(e classification accuracy of the testing data is a gauge toevaluate the ability of the HICFDS to detect and identifyfraudulent claims (e testing data used to assess and

evaluate the efficiency of the proposed HICFDS (classifier)are taken exclusively from NHIS headquarters and coversdifferent hospitals within the Greater Accra Region ofGhana Sampled data with the corresponding fraud typesafter the analysis are shown in Table 2

In evaluating the classifiers obtained with the analyzedmethods the most widely employed performance measuresare used accuracy sensitivity and specificity with theirconcepts of True Legal (TP) False Fraudulent (FN) FalseLegal (FP) and True Fraudulent (TN) (is classification isshown in Table 3

(e figures below show the SVC plots on the variousclassifiers (linear polynomial and RBF) on the claimsdatasets (Figures 11ndash13)

From the performance metrics and overall statisticspresented in Table 4 it is observed that the support vectormachine performs better classification with an accuracy of8791 using the RBF kernel function followed by the

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 12 Polynomial SVM on a sample claims dataset

00 20 40 60 80 100 120 140 160

10

20

30

40

50

60

70

80

90

100

Legal bills (training)Legal bills (classified)Fraudulent bills (training)

Fraudulent bills (classified)Support vectors

Figure 13 RBF SVM on a sample claims dataset

Table 4 Averages performance analysis of SVM classifiers

Description Accuracy Sensitivity SpecificityLinear 8067 8448 5597Polynomial 8122 8699 6128RBF 8791 8980 8118

Table 5 Confusion matrix for SVM classifiers

Description Data size TP TN FP FN Correct rate

Linear

100 3 7 2 2 714300 16 0 3 3 713500 88 24 8 2 918750 57 8 9 3 8441000 41 32 8 7 830

Polynomial

100 2 8 3 1 714300 15 1 4 2 723500 92 26 4 0 967750 53 91 13 2 8051000 41 31 5 8 852

Radial basis function

100 4 6 1 3 714300 19 2 0 1 955500 95 26 1 0 992750 62 9 13 2 9221000 41 39 8 0 919

12 Journal of Engineering

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

polynomial kernel with 8122 accuracy and hence linearSVM emerging as the least performance classifier with anaccuracy of 8067 (e confusion matrix for the SSVMclassifiers is given in Table 5 where i utilized in the com-putation of the performance metric of the SVM classifiersFor the purpose of statistical and machine learning classi-fication tasks a confusion matrix also known as an errormatrix is a specific table layout that allows visualization ofthe performance of a supervised learning algorithm

Besides classification the amount of time required inprocessing the sample dataset is also an important con-sideration in this research From the above the comparedcomputational time shows that increase in the size of thesample dataset also increases the computational time neededto execute the process regardless of the machine used whichis widely expected(is difference in time costs is merely dueto the cause of training the dataset (us as global datawarehouse increases more computational resources will beneeded in machine learning and data mining researchpertaining to the detection of insurance fraud as depicted inFigure 14 relating the average computational time andsample data

Figure 15 summarizes the fraudulent claims detectedduring the testing of the HICFD with the sample datasetused As the sample data size increases the number ofsuspected claims increases rapidly based on the variousfraudulent types detected

Benchmarking HICFD analysis ensures understandingof HIC outcomes From the chart above an increase in theclaims dataset has a corresponding increase in the number ofsuspected claims(e graph in Figure 16 shows a sudden risein the level of suspected claims on tested 100 datasets rep-resenting 52 of the sample dataset after which it continuesto increase slightly on the suspected numbers of claims by2 to make up 58 on the tested data size of 300 claims

Among these fraud types the most frequent fraudulentact is uncovered services rendered to insurance subscribersby service providers It accounts for 22 of the fraudulentclaims as to the most significant proportion of the totalhealth insurance fraud on the total tested dataset Conse-quently overbilling of submitted claims is recorded as thesecond fraudulent claims type representing 20 of the totalsample dataset used for this research (is is caused byservice providers billing for a service greater than the ex-pected tariff to the required diagnoses Listing and billing fora more complex or higher level of service by providers aredone to boost their financial income flow unfairly in thelegitimate claims

500Average computational time

400

300

200

Tim

e (s)

1000 200 400 600

Sample data size800 1000

Figure 14 Computational time on the tested sample dataset

800

Susp

ecte

d cl

aim

s

600

400

200

0

Sample data size0 200 400 600 800 1000

Figure 15 Detected fraud trend on the tested claims dataset

0100 dataset

2 4 4

5660

1834

2

65

91

50

10 104

44

0 2 0300 dataset 500 dataset 750 dataset 1000 dataset

50

100

150

200

250

300

350

400

Duplicated claimsUncovered serviceOverbilled claims

UnbundledUpcodedImpersonation

121122109

4

54

2300 6

34

202

406

Figure 16 Chart of types of fraudulent claims

Table 6 Cost analysis of tested claims dataset

Sampledatasize

Raw costof claims

(R)GHC

Validclaimscost (V)

Deviation(RndashV)

Percentagedifference

100 2079183 891172 1188011 13331300 3149605 156227 1587335 10160500 5821865 2748096 3073769 11185750 8839407 3109158 5730249 184301000 1174482 4794338 6950482 14497

Journal of Engineering 13

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 14: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

Figure 17 Classification Learner App showing the various algorithms and percentage accuracies in MATLAB

Table 7 Comparison of results of GSVM with decision trees and NaıvendashBayes

Description of the algorithm used Claims dataset Accuracy obtained with the correspondingdataset

Average value over differentdatasets

GSVM with radial basis function (RBF) kernel

100 7143

87906300 9545500 9918750 82561000 9091

Decision trees

100 62

7444300 78500 778750 8271000 717

NaıvendashBayes

100 50

591300 61500 568750 6071000 67

14 Journal of Engineering

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 15: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

Moreover some illicit service providers claim to haverendered service to insurance subscribers on costly servicesinstead of providing more affordable ones Claims preparedon expensive service rendered to insurance subscribersrepresent 8 of the fraudulent claims detected on the totalsample dataset Furthermore 31 of service procedure thatshould be considered an integral part of a single procedureknown as the unbundle claims contributed to the fraudulentclaims of the set of claims dataset used as the test data Due tothe insecure process for quality delivery of healthcare ser-vice insurance subscribers are also contributing to thefraudulent type of claims by loaning their ID cards to familymembers of the third party who pretend to be owners andrequest for the HIS benefits in the healthcare sector Du-plicated claims as part of the fraudulent act recorded theminimum rate of 05 of contribution to fraudulent claimsin the whole sample dataset

As observed in Table 6 the cost of the claims bill in-creases proportionally with an increase in the sample size ofthe claims bill (is is consistent with an increase infraudulent claims as sample size increases From Table 6 wecan see the various costs for each raw record (R) of sampleclaim dataset Valid claims bill after processing dataset thevariation in the claims bill (RndashV) and their percentagerepresentation as well are illustrated in Table 6 (ere is a27 financial loss of the total submitted claim bills to in-surance carriers(is loss is the highest rate of loss within the750 datasets of submitted claims

Summary of results and comparison with other machinelearning algorithms such as decision trees and NaıvendashBayesis presented in Table 7

(e MATLAB Classification Learner App [43] waschosen to validate the results obtained above It enables easeof comparison with the different methods of classification

Figure 18 Algorithmic runs on 500-claim dataset

Journal of Engineering 15

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 16: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

algorithms implemented (e data used for the GSVM weresubsequently used in the Classification Learner App asshown below

Figures 17 and 18 show the classification learner appwith the various implemented algorithms and corre-sponding accuracies in MATLAB technical computinglanguage environment and the results obtained using the500-claim dataset respectively Figures 19 and 20 depict thesubsequent results when the 750- and 1000-claim datasetswere utilized for the algorithmic runs and reproduciblecomparison respectively (e summarized results and ac-curacies are illustrated in Table 7 (e summarized results inTable 7 portray the effectiveness of our proposed approach ofusing the genetic support vector machines (GSVMs) forfraud detection of insurance claims From the result it isevident that GSVM achieves a higher level of accuracycompared to decision trees and NaıvendashBayes

5 Conclusions and Recommendations

(is work aimed at developing a novel fraud detectionmodel for insurance claims processing based on geneticsupport vector machines which hybridizes and draws onthe strengths of both genetic algorithms and supportvector machines (e GSVM has been investigated andapplied in the development of HICFDS (is paper usedGSVM for detection of anomalies and classification ofhealth insurance claims into legitimate and fraudulentclaims SVMs have been considered preferable to otherclassification techniques due to several advantages (eyenable separation (classification) of claims into legitimateand fraudulent using the soft margin thus accommodatingupdates in the generalization performance of HICFDSWith other notable advantages it has a nonlinear dividing

Figure 19 Algorithmic runs on 750-claim dataset

16 Journal of Engineering

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 17: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

hyperplane which prevails over the discrimination withinthe dataset (e generalization ability of any newly arriveddata for classification was considered over other classifi-cation techniques

(us the fraud detection system provides a combinationof two computational intelligence schemes and achieveshigher fraud detection accuracy (e average classificationaccuracies achieved by the SVCs are 8067 8122 and8791 which show the performance capability of the SVCsmodel(ese classification accuracies are obtained due to thecareful selection of the features for training and developingthe model as well as fine-tuning the SVCsrsquo parameters usingtheV-fold cross-validation approach(ese results are muchbetter than those obtained using decision trees andNaıvendashBayes

(e average sample dataset testing results for theproposed SVCs vary due to the nature of the claims dataset

used (is is noted in the cluster of the claims dataset(MDC specialty) When the sample dataset is muchskewed to one MDC specialty (eg OPDC) the perfor-mance of the SVCs could tune to one classifier especiallythe linear SVM as compared to others Hence the be-haviour of the dataset has a significant impact on clas-sification results

Based on this work the developed GSVM model wastested and validated using HIC data (e study sought toobtain the best performing classifier for analyzing the healthinsurance claims datasets for fraud (e RBF kernel wasadjudged the best with an average accuracy rate of 8791(e RBF kernel is therefore recommended

Figure 20 Algorithmic runs on the 1000-claim dataset

Journal of Engineering 17

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 18: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

Data Availability

(e data used in this study are available upon request (edata can be uploaded when required

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(e authors of this paper wish to acknowledge the CarnegieCorporation of New York through the University of Ghanaunder the UG-Carnegie Next Generation of Academics inAfrica project for organizing Write Shops that led to thetimely completion of this paper

Supplementary Materials

(e material consists of MS Excel file data collected fromsome NHIS-approved hospitals in Ghana concerning in-surance claims Its insurance claims dataset used for testingand implementation (Supplementary Materials)

References

[1] G of Ghana National Health Insurance Act Act 650 2003Ghana 2003

[2] Capitation National Health Insurance Scheme 2012 httpwwwnhisgovghcapitationaspx

[3] ICD-10 Version2016 httpappswhointclassificationsicd10browse2016en

[4] T Olson Examining the Transitional Impact of ICD-10 onHealthcare Fraud Detection College of Saint BenedictSaintJohnrsquos University Collegeville MN USA 2015

[5] News Ghana NHIS Manager Arrested for Fraud | NewsGhana News Ghana Accra Ghana 2014 httpswwwnewsghanacomghnhis-manager-arrested-for-fraud

[6] BioClaim Files httpwwwbioclaimcomFraud-Files[7] Graphics Online Ghana news Dr Ametewee Defrauds NHIA

of GHcent415000mdashGraphic Online Graphics Online AccraGhana 2015 httpwwwgraphiccomghnewsgeneral-newsdr-ametewee-defrauds-nhia-of-gh-415-000html

[8] W-S Yang and S-Y Hwang ldquoA process-mining frameworkfor the detection of healthcare fraud and abuserdquo ExpertSystems with Applications vol 31 no 1 pp 56ndash68 2006

[9] G C van Capelleveen Outlier Based Predictors for HealthInsurance Fraud Detection within US Medicaid University ofTwente Enschede Netherlands 2013

[10] Y Shan D W Murray and A Sutinen ldquoDiscovering in-appropriate billings with local density-based outlier detectionmethodrdquo in Proceedings of the Eighth Australasian DataMining Conference vol 101 pp 93ndash98 Melbourne AustraliaDecember 2009

[11] L D Weiss and M K Sparrow ldquoLicense to steal how fraudbleeds Americarsquos health care systemrdquo Journal of Public HealthPolicy vol 22 no 3 pp 361ndash363 2001

[12] P Travaille RMMuller D(ornton and J VanHillegersbergldquoElectronic fraud detection in the US Medicaid healthcareprogram lessons learned from other industriesrdquo in Proceedingsof the 17th Americas Conference on Information Systems(AMCIS) pp 1ndash11 Detroit Michigan August 2011

[13] A Abdallah M A Maarof and A Zainal ldquoFraud detectionsystem a surveyrdquo Journal of Network and Computer Appli-cations vol 68 pp 90ndash113 2016

[14] A K I Hassan and A Abraham ldquoComputational intelligencemodels for insurance fraud detection a review of a decade ofresearchrdquo Journal of Network and Innovative Computingvol 1 pp 341ndash347 2013

[15] E Kirkos C Spathis and Y Manolopoulos ldquoData Miningtechniques for the detection of fraudulent financial state-mentsrdquo Expert Systems with Applications vol 32 no 4pp 995ndash1003 2007

[16] H Joudaki A Rashidian B Minaei-Bidgoli et al ldquoUsing datamining to detect health care fraud and abuse a review ofliteraturerdquo Global Journal of Health Science vol 7 no 1pp 194ndash202 2015

[17] V Rawte and G Anuradha ldquoFraud detection in health in-surance using data mining techniquesrdquo in Proceedings of the2015 International Conference on Communication In-formation amp Computing Technology (ICCICT) pp 1ndash5Mumbai India January 2015

[18] J Li K-Y Huang J Jin and J Shi ldquoA survey on statisticalmethods for health care fraud detectionrdquo Health CareManagement Science vol 11 no 3 pp 275ndash287 2008

[19] Q Liu and M Vasarhelyi ldquoHealthcare fraud detection asurvey and a clustering model incorporating geo-locationinformationrdquo in Proceedings of the 29th World ContinuousAuditing and Reporting Symposium Brisbane AustraliaNovember 2013

[20] T Ekin F Leva F Ruggeri and R Soyer ldquoApplication ofBayesian methods in detection of healthcare fraudrdquo ChemicalEngineering Transactions vol 33 pp 151ndash156 2013

[21] Homemdash(e NHCAA httpswwwnhcaaorg[22] S Viaene R A Derrig and G Dedene ldquoA case study of

applying boosting naive Bayes to claim fraud diagnosisrdquo IEEETransactions on Knowledge and Data Engineering vol 16no 5 pp 612ndash620 2004

[23] Y Singh and A S Chauhan ldquoNeural networks in dataminingrdquo Journal of Feoretical and Applied InformationTechnology vol 5 no 1 pp 37ndash42 2009

[24] D Tomar and S Agarwal ldquoA survey on data mining ap-proaches for healthcarerdquo International Journal of Bio-Scienceand Bio-Technology vol 5 no 5 pp 241ndash266 2013

[25] P Vamplew A Stranieri K-L Ong P Christen andP J Kennedy ldquoData mining and analytics 2011rdquo in Pro-ceedings of the Ninth Australasian Data Mining Conference(AusDMrsquoA) Australian Computer Society Ballarat AustraliaDecember 2011

[26] K S Ng Y Shan D W Murray et al ldquoDetecting non-compliant consumers in spatio-temporal health data a casestudy from medicare Australiardquo in Proceedings of the 2010IEEE International Conference on Data Mining Workshopspp 613ndash622 Sydney Australia December 2010

[27] J F Roddick J Li P Christen and P J Kennedy ldquoDatamining and analytics 2008rdquo in Proceedings of the 7th Aus-tralasian Data Mining Conference (AusDM 2008) vol 87pp 105ndash110 Glenelg South Australia November 2008

[28] C Watrin R Struffert and R Ullmann ldquoBenfordrsquos Law aninstrument for selecting tax audit targetsrdquo Review of Man-agerial Science vol 2 no 3 pp 219ndash237 2008

[29] F Lu and J E Boritz ldquoDetecting fraud in health insurancedata learning to model incomplete Benfordrsquos Law distribu-tionsrdquo in Machine Learning J Gama R CamachoP B Brazdil A M Jorge and L Torgo Eds pp 633ndash640Springer Berlin Heidelberg 2005

18 Journal of Engineering

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 19: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

[30] P Ortega C J Figueroa and G A Ruz ldquoA medical claimfraudabuse detection system based on data mining a casestudy in Chilerdquo DMIN vol 6 pp 26ndash29 2006

[31] T Back J M De Graaf J N Kok and W A Kosters Feoryof Genetic Algorithms World Scientific Publishing RiverEdge NJ USA 2001

[32] M Melanie An Introduction to Genetic Algorithms (e MITPress Cambridge MA USA 1st edition 1998

[33] D Goldberg Genetic Algorithms in Optimization Search andMachine Learning Addison-Wesley Reading MA USA1989

[34] J Wroblewski ldquo(eoretical foundations of order-based ge-netic algorithmsrdquo Fundamental Informaticae vol 28 no 3-4pp 423ndash430 1996

[35] J H Holland Adaptation in Natural and Artificial SystemsAn Introductory Analysis with Applications to Biology Con-trol and Artificial Intelligence MIT Press Cambridge MAUSA 1st edition 1992

[36] V N Vapnik Fe Nature of Statistical Learning FeorySpringer New York NY USA 2nd edition 2000

[37] J Salomon Support Vector Machines for Phoneme Classifi-cation University of Edinburgh Edinburgh UK 2001

[38] J Platt Sequential Minimal Optimization A Fast Algorithmfor Training Support Vector Machines Microsoft ResearchRedmond WA USA 1998

[39] J Platt ldquoUsing analytic QP and sparseness to speed training ofsupport vector machinesrdquo in Proceedings of the Advances inNeural Information Processing Systems Cambridge MAUSA 1999

[40] C-W Hsu C-C Chang and C-J Lin A Practical Guide toSupport Vector Classification Data Science Association Tai-pei Taiwan 2003

[41] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 2003

[42] D Dvorski Installing Configuring and Developing withXAMPP Ski Canada Magazine Toronto Canada 2007

[43] MATLAB Classification Learner App MATLAB Version 2019aMathworks Computer Software Company Natick MS USA2019 httpwwwmathworkscomhelpstatsclassification-learner-apphtml

Journal of Engineering 19

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 20: DecisionSupportSystem(DSS)forFraudDetectioninHealth ...downloads.hindawi.com/journals/je/2019/1432597.pdf · ResearchArticle DecisionSupportSystem(DSS)forFraudDetectioninHealth InsuranceClaimsUsingGeneticSupportVector

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom