an ensemble semi-supervised learning method for predicting...

7
Engineering Applications of Artificial Intelligence 81 (2019) 193–199 Contents lists available at ScienceDirect Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai An ensemble semi-supervised learning method for predicting defaults in social lending Aleum Kim, Sung-Bae Cho Department of Computer Science, Yonsei University, Seoul, Republic of Korea ARTICLE INFO Keywords: Dempster–Shafer theory Social lending Default prediction Label propagation TSVM ABSTRACT Social lending is made between peers, and with the risk that the investor can take direct damages from the borrower’s failure to repay, accurate default prediction for borrowers is important. The repayment result can be known after the end of the repayment period, and such data is limited. However, social loans are matched online in real time and large amounts of unlabeled data are being generated. In this paper, we propose a method to combine label propagation and transductive support vector machine (TSVM) with Dempster–Shafer theory for accurate default prediction of social lending using unlabeled data. In order to train a lot of data effectively, we ensemble semi-supervised learning methods with different characteristics. Label propagation is performed so that data having similar features are assigned to the same class and TSVM makes moving away data having different features. Dempster–Shafer fusion method allows accurate labeling by exploiting the merits of the two methods. Experiments are performed using the open data set from Lending Club. The accuracy of the proposed method is improved by about 10% against that of the model using only labeled data, and more accurate labeling can be performed through the proposed ensemble method. 1. Introduction Social lending is made online in peer to peer (Zhao et al., 2017). It only provides a platform to connect borrowers with investors, and there is no arbitration institution, so that the investor can easily take a default risk of the borrower (Li et al., 2016; Yan et al., 2015; Luo et al., 2011). Research on the default prediction of social loans is actively being studied. In previous studies, labeled data were used by applying statistical methods or supervised learning (Li et al., 2018; Wang et al., 2018; Zhou et al., 2018; Lin et al., 2017a; Ge et al., 2017; Jiang et al., 2017a; Polena and Regner, 2016a; Everett, 2015; Serrano-Cinca et al., 2015). However, the repayment of the borrower is known only after the end of the repayment period, and such labeled data are limited. A large amount of unlabeled data is being generated in the social lending where loans are matched online in real time (Kim and Cho, 2017). We need to improve the performance of the default prediction by utilizing the additional unlabeled data. Social lending provides a variety of data from personal information to credit records. Fig. 1 is a visualization of the 1st and 2nd principal components by performing principal component analysis (PCA) using the data in Lending Club Statistics. Since data is clustered regardless of class, the data with similar characteristics may belong to different No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.02.014. Correspondence to: Department of Computer Science, Yonsei University, 50 Yonsei-ro, Seoul 120-749, Republic of Korea. E-mail addresses: [email protected] (A. Kim), [email protected] (S.-B. Cho). classes. It is important to extract the features related to the class in order to improve the performance of default prediction. In this paper, we propose an ensemble semi-supervised learning method which improves the default prediction performance by con- sidering the characteristics of social lending data. We apply the semi- supervised learning method that strengthens the decision boundary by utilizing large unlabeled data of social lending. Label propagation learns similar features in the same class, and transductive support vector machine (TSVM) learns different features among classes (Zhu, 2011). Dempster–Shafer fusion method effectively determines whether or not unlabeled data are indebted based on the results of the semi- supervised learning methods (Dempster, 1967; Sentz and Ferson, 2002; Shafer, 1976). By training unlabeled data given class with supervised classifier, it is easy to extract features relevant to the class. In order to show the usefulness of the proposed method, we have conducted experiments using the social loan data consisting of more than 2/3 unlabeled data. The remaining of the paper is organized as follows. Section 2 introduces the related works for default prediction in social lending. Subsequently, we introduce the core of the proposed method with the detailed description of the application in Section 3. The experi- ment with result analysis using real social lending data are shown in Section 4. https://doi.org/10.1016/j.engappai.2019.02.014 Received 30 December 2017; Received in revised form 19 February 2019; Accepted 21 February 2019 Available online xxxx 0952-1976/© 2019 Elsevier Ltd. All rights reserved.

Upload: others

Post on 26-Dec-2019

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An ensemble semi-supervised learning method for predicting ...sclab.yonsei.ac.kr/publications/Papers/IJ/2019_EAAI_ALK.pdf · away data having different features. Dempster–Shafer

Engineering Applications of Artificial Intelligence 81 (2019) 193–199

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence

journal homepage: www.elsevier.com/locate/engappai

An ensemble semi-supervised learning method for predicting defaults insocial lending✩

Aleum Kim, Sung-Bae Cho ∗

Department of Computer Science, Yonsei University, Seoul, Republic of Korea

A R T I C L E I N F O

Keywords:Dempster–Shafer theorySocial lendingDefault predictionLabel propagationTSVM

A B S T R A C T

Social lending is made between peers, and with the risk that the investor can take direct damages from theborrower’s failure to repay, accurate default prediction for borrowers is important. The repayment result canbe known after the end of the repayment period, and such data is limited. However, social loans are matchedonline in real time and large amounts of unlabeled data are being generated. In this paper, we propose amethod to combine label propagation and transductive support vector machine (TSVM) with Dempster–Shafertheory for accurate default prediction of social lending using unlabeled data. In order to train a lot of dataeffectively, we ensemble semi-supervised learning methods with different characteristics. Label propagationis performed so that data having similar features are assigned to the same class and TSVM makes movingaway data having different features. Dempster–Shafer fusion method allows accurate labeling by exploitingthe merits of the two methods. Experiments are performed using the open data set from Lending Club. Theaccuracy of the proposed method is improved by about 10% against that of the model using only labeled data,and more accurate labeling can be performed through the proposed ensemble method.

1. Introduction

Social lending is made online in peer to peer (Zhao et al., 2017).It only provides a platform to connect borrowers with investors, andthere is no arbitration institution, so that the investor can easily take adefault risk of the borrower (Li et al., 2016; Yan et al., 2015; Luo et al.,2011). Research on the default prediction of social loans is activelybeing studied. In previous studies, labeled data were used by applyingstatistical methods or supervised learning (Li et al., 2018; Wang et al.,2018; Zhou et al., 2018; Lin et al., 2017a; Ge et al., 2017; Jiang et al.,2017a; Polena and Regner, 2016a; Everett, 2015; Serrano-Cinca et al.,2015). However, the repayment of the borrower is known only afterthe end of the repayment period, and such labeled data are limited. Alarge amount of unlabeled data is being generated in the social lendingwhere loans are matched online in real time (Kim and Cho, 2017). Weneed to improve the performance of the default prediction by utilizingthe additional unlabeled data.

Social lending provides a variety of data from personal informationto credit records. Fig. 1 is a visualization of the 1st and 2nd principalcomponents by performing principal component analysis (PCA) usingthe data in Lending Club Statistics. Since data is clustered regardlessof class, the data with similar characteristics may belong to different

✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work.For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.02.014.∗ Correspondence to: Department of Computer Science, Yonsei University, 50 Yonsei-ro, Seoul 120-749, Republic of Korea.

E-mail addresses: [email protected] (A. Kim), [email protected] (S.-B. Cho).

classes. It is important to extract the features related to the class inorder to improve the performance of default prediction.

In this paper, we propose an ensemble semi-supervised learningmethod which improves the default prediction performance by con-sidering the characteristics of social lending data. We apply the semi-supervised learning method that strengthens the decision boundaryby utilizing large unlabeled data of social lending. Label propagationlearns similar features in the same class, and transductive supportvector machine (TSVM) learns different features among classes (Zhu,2011). Dempster–Shafer fusion method effectively determines whetheror not unlabeled data are indebted based on the results of the semi-supervised learning methods (Dempster, 1967; Sentz and Ferson, 2002;Shafer, 1976). By training unlabeled data given class with supervisedclassifier, it is easy to extract features relevant to the class. In orderto show the usefulness of the proposed method, we have conductedexperiments using the social loan data consisting of more than 2/3unlabeled data.

The remaining of the paper is organized as follows. Section 2introduces the related works for default prediction in social lending.Subsequently, we introduce the core of the proposed method withthe detailed description of the application in Section 3. The experi-ment with result analysis using real social lending data are shown inSection 4.

https://doi.org/10.1016/j.engappai.2019.02.014Received 30 December 2017; Received in revised form 19 February 2019; Accepted 21 February 2019Available online xxxx0952-1976/© 2019 Elsevier Ltd. All rights reserved.

Page 2: An ensemble semi-supervised learning method for predicting ...sclab.yonsei.ac.kr/publications/Papers/IJ/2019_EAAI_ALK.pdf · away data having different features. Dempster–Shafer

A. Kim and S.-B. Cho Engineering Applications of Artificial Intelligence 81 (2019) 193–199

Fig. 1. 2D plot of principal components for random 200 samples in Lending Club data.

2. Related works

Recently, many researchers are studying on the default predictionof social lending. Table 1 shows the works related to predicting defaultsof social lending in recent years.

Statistical methods have been widely used for predicting defaults offinancial institutions. Most of the works in this category took into ac-count the causal relationship with the state of the default by identifyingthe main attributes using statistical analysis. Chen et al. conducted thedefault prediction using only the main attributes in consideration of therepayment process of the borrower over time (Chen, 2017). Emekteralso analyzed the important features using logistic regression (Emekteret al., 2015). Li et al. analyzed the company’s risk distribution (Li et al.,2016). Social lending, unlike financial institutions, offers a lot of dataso that additional information about borrowers can be collected.

In recent years, the studies have been conducted to improve theperformance of the default prediction. Ge et al. proposed a method ofpredicting the number of obligations over a certain period using theborrower’s social media account information (Ge et al., 2017). Bitvaiet al. improved the performance of predicting the default including thenumber of specific words in sentences in the loan application form (Bit-vai and Cohn, 2015). Lin et al. proposed a credit risk assessmentmodel using Yooli dataset, P2P lending platform in China (Lin et al.,2017b). They used the nonparametric statistical method to identify theborrowers’ demographic characteristics and extract the features thataffect the default. Guo et al. proposed a data-driven investment decisionframework for emerging markets (Guo et al., 2016). They designedan instance-based credit risk assessment model with logistic regressionthat can assess the return and risk of each individual loan. However,there is a limitation that it focuses on improving the performancethrough empirical study rather than revealing the direct relationshipbetween the default prediction and additional data.

Machine learning has been used to solve more complex problemssuch as debt behavior scoring and credit scoring as well as default pre-diction. Serrano-Cinca developed credit scoring system using decision

tree (Serrano-Cinca and Gutiérrez-Nieto, 2016). Wang et al. utilized thetime series characteristics of social lending and proposed an ensemblemethod based on random forest to predict defaults (Wang et al., 2018).Also, some methods attempted to improve the performance by usingensemble technique. Xia et al. proposed XGBoost for predicting defaultsbased on internal return rate (IRR) (Xia et al., 2017). Huo et al. triedto solve behavior scoring problem in social lending using the neuralnetworks combined with logistic regression (Huo et al., 2017).

Recently, some researchers improved the debt prediction perfor-mance using deep learning (Zhang et al., 2017). Fu et al. predicteddefaults by combining neural networks with random forests to cap-ture non-linearity (Fu, 2017). Jiang suggested support vector machineand random forest prediction method for P2P loan combined withsmooth information related to text description (Jiang et al., 2017b).Ma et al. used XGboost machine learning algorithm, LightGBM, using‘multi-observation’ and ‘multi-dimensional’ data cleaning methods (Maet al., 2018). However, considering the complexity of the method, theamount of experimental data is small and there is a limitation thatthe performance is not improved significantly. Empirical comparisonamong machine learning methods were done as well (Malekipirbazariand Aksakalli, 2015). One study for semi-supervised learning in sociallending applied a semi-supervised support vector machine (S3VM) topredict whether an applicant would be approved for loan applicationon the social lending site or not (Li et al., 2017).

3. The proposed method

In order to predict the default in social lending, we propose amethod to improve the performance by using unlabeled data as shownin Fig. 2. Since we cannot expect the high performance of the defaultprediction based on the limited labeled data, semi-supervised learningis applied to add labeled data from large unlabeled data. We adoptlabel propagation and TSVM to get the tentative labels from the un-labeled data, and the final label of unlabeled data is determined by theDempster–Shafer fusion.

3.1. Label propagation

Label propagation improves the prediction performance by usingunlabeled data based on cluster assumption (Blum and Chawla, 2001).The cluster assumption allows the closely located data as belonging tothe same class. Label propagation uses the unlabeled data to clustersimilar data. Specifically, as shown in Fig. 3, the degree of similaritybetween data is composed of a transition matrix to propagate the classprobability.

3.2. Transductive SVM

TSVM is a typical semi-supervised learning method for low-densityseparation (Joachims, 1999). The semi-supervised learning methodbased on low-density separation trains that decision boundaries are

Fig. 2. An overview of the proposed method.

194

Page 3: An ensemble semi-supervised learning method for predicting ...sclab.yonsei.ac.kr/publications/Papers/IJ/2019_EAAI_ALK.pdf · away data having different features. Dempster–Shafer

A. Kim and S.-B. Cho Engineering Applications of Artificial Intelligence 81 (2019) 193–199

Table 1Studies on default prediction in social lending.

Year Author Task Method #Data Data characteristics

2018 Wang et al. (2018) Behavioral scoring Ensemble mixture random forest 46,494 Time series2018 Ma et al. (2018) Default prediction LightGBM and XGboost 569,339 P2P transaction2017 Lin et al. (2017b) Default prediction Logistic regression 48,784 P2P transaction2017 Fu (2017) Default prediction Combination of random forest 1,320,000 P2P transaction2017 Jiang et al. (2017b) Default prediction Support vector machine, random forest 39,538 P2P transaction2017 Zhang et al. (2017) Default prediction Long short-term memory 600 Time series, small amount2017 Chen (2017) Default prediction Logistic regression 3,650 Time series2017 Huo et al. (2017) Default prediction Logistic regression, Neural network 4,518 P2P transaction2017 Ge et al. (2017) Default prediction Logistic regression 35,457 Social media information2017 Xia et al. (2017) Default prediction XGBoost 49,795 P2P transaction2016 Guo et al. (2016) Default prediction Logistic regression 2,016,128 P2P transaction2016 Polena and Regner (2016b) Default prediction Regression 70,673 P2P transaction2016 Li et al. (2016) Default prediction Logistic regression 48,784 P2P transaction2016 Serrano-Cinca and Gutiérrez-Nieto (2016) Internal rate of return Decision tree 40,901 P2P transaction2015 Bitvai and Cohn (2015) Default prediction Bayesian non-linear regression 3,500 Sentences2015 Byanjankar et al. (2015) Credit scoring Neural network 16,037 P2P transaction

Fig. 3. Learning algorithm of label propagation.

formed such that data are less dense (Chapelle and Zien, 2005). TSVMsearches for the optimal decision boundary by assigning and changingthe class of unlabeled data. Therefore, it is learned that data of otherclasses are located far away. The specific learning process is shown inFig. 4.

3.3. Dempster–Shafer fusion

To predict the default in the social lending, we determine theclass of unlabeled data by combining the results of label propagationand TSVM with the Dempster–Shafer theory. This gives the probabil-ities even when two or more classes are mixed unlike the commonlyused Bayesian theory (Dempster, 1967; Shafer, 1976; Bernardo andSmith, 2001). For example, the Bayesian theory gives the probability tofully paid and charged off, while Dempster–Shafer probability theorygives the probability to fully paid, <fully paid, charged off> andcharged off. This allows precise labeling in the case that two classesare mixed together (Kim and Cho, 2017). We use a combination rulethat computes the probability of joining while giving a probability bysubdividing the case (Sentz and Ferson, 2002). The following is anexample of combining the results of label propagation and TSVM withthe Dempster–Shafer probability theory for predicting the default insocial lending.

First, the degree of belief of each case is calculated using the classprobabilities of the classifiers. For example, when label propagationpredicts a data to fully paid with the probability 0.8, by Eq. (1)

𝑚𝑀 (𝑐) =𝑃𝑀 (𝑐)

𝑐′∈𝐶 𝑃𝑀 (𝑐′)(1)

where the 𝑃𝑀 (𝑐) is the class probabilities of class 𝑐 for classifier 𝑀 ,and 𝐶 is a set of classes with the unknown. In order to give an idea onhow this method works, let us consider an example. Suppose that theprobability of charged off be converted into 0.4, and 0.1 for fully paid

Fig. 4. Learning algorithm of TSVM.

195

Page 4: An ensemble semi-supervised learning method for predicting ...sclab.yonsei.ac.kr/publications/Papers/IJ/2019_EAAI_ALK.pdf · away data having different features. Dempster–Shafer

A. Kim and S.-B. Cho Engineering Applications of Artificial Intelligence 81 (2019) 193–199

Fig. 5. Performance in labeling test.

Fig. 6. Distribution of the combined probability.

and 0.5 for unknown which is a conjunction case. TSVM is processedequally to predict the data to charged off by 0.6 confidence, and theconverted probabilities are 0.2, 0.3, and 0.5 for fully paid, charged offand the unknown, respectively.

Then, a fusion based on Dempster’s rule is processed. In the exampleabove, Eq. (2) shows the results using the independent probability

fusion rule and Eq. (3) shows the result of the above example:

𝑚𝑝𝑟𝑒𝐷 (𝑐) =

𝐴∩𝐵=𝑐𝑚1 (𝐴)𝑚2 (𝐵) (2)

𝑚𝑝𝑟𝑒𝐷 (charged off) = 0.1 × 0.3 + 0.1 × 0.5 + 0.3 × 0.5 = 0.38

𝑚𝑝𝑟𝑒𝐷 (fully paid) = 0.4 × 0.2 + 0.4 × 0.5 + 0.2 × 0.5 = 0.23 (3)

196

Page 5: An ensemble semi-supervised learning method for predicting ...sclab.yonsei.ac.kr/publications/Papers/IJ/2019_EAAI_ALK.pdf · away data having different features. Dempster–Shafer

A. Kim and S.-B. Cho Engineering Applications of Artificial Intelligence 81 (2019) 193–199

Fig. 7. Data distribution according to the classes.

𝑚𝑝𝑟𝑒𝐷 (unknown) = 0.5 × 0.5 = 0.25

𝑚𝑝𝑟𝑒𝐷 (∅) = 0.1 × 0.2 + 0.4 × 0.3 = 0.14

Table 2The attributes used in Lending Club dataset.

Category Description # attributes

Loan Information about the achieved loan 2Borrower Borrower’s personal and demographic

information3

Credit history Borrower’s number of accounts,delinquencies, credit limit ratio, andbalance for mortgage, revolving, bankcard accounts.

12

Table 3Analysis for class probability.

Method (a) (b) (c)

Label propagation Fully paid 0.7602 Charged off 0.6187 Charged off 0.7071TSVM Fully paid 0.8007 Fully paid 0.7544 Fully paid 0.7568Proposed method Fully paid 0.6090 Fully paid 0.5987 REMOVE 0.3515Actual class Fully paid – Fully paid – Charged off –

Then, we remove the probability of ∅ which means that labelingresults are not the same and normalize the remaining probabilities:

𝑚𝐷 (charged off) =𝑚𝑝𝑟𝑒𝐷 (charged off)1 − 𝑚𝑝𝑟𝑒

𝐷 (∅)= 0.4419

𝑚𝐷 (fully paid) =𝑚𝑝𝑟𝑒𝐷 (fully paid)1 − 𝑚𝑝𝑟𝑒

𝐷 (∅)= 0.2674 (4)

𝑚𝐷 (unknown) =𝑚𝑝𝑟𝑒𝐷 (unknown)1 − 𝑚𝑝𝑟𝑒

𝐷 (∅)= 0.2907

This can maximize the similarity and minimize the difference.Eq. (5) generalizes the fusion rule:

𝑚𝐷 (𝑐|𝑥) =∑

𝐴∩𝐵=𝑐 𝑚1 (𝐴|𝑥)𝑚2 (𝐵|𝑥)1 −

𝐴∩𝐵=∅ 𝑚1 (𝐴|𝑥)𝑚2 (𝐵|𝑥)(5)

As a result, the three probabilities are calculated, and with this wecan label the data. We use a simple rule to label with the combinedprobabilities. A data point can be removed by two-step filtering. In thefirst step, if a maximum probability in combined set is ‘‘unknown’’, thecorresponding data will be removed because ‘‘unknown’’ means thatclasses are congested, so that we cannot give accurate labels to the data.The second step plays a similar role. If all the probabilities are the same,we cannot identify which class is correct. In the above example, theprobability of fully paid is the largest, and finally the data is labeledas fully paid. We can see that TSVM induces a different label, but itis fixed by the fusion with label propagation. The characteristics canenhance the labeling performance to become more accurate.

4. Experimental result

4.1. Lending Club dataset

In the experiment, we use the Lending Club dataset Lending ClubStatistics. The original dataset has 110 attributes, but we have pre-processed with normalization and binary dummies. As a result, we get69 attributes with binary classes and Table 2 shows the attributes forthe experiment. The experiments are composed of labeling test andcomparison test. The labeling test uses the labeled data only to measurelabeling performance. The actual labels of labeled data are removed,and then each method assigns a new label to it. We compare the labeledone with the actual label. On the other hand, the comparison test usesall the data except for test set and measures the performance for thenew data. In the labeling test, the data are divided into 20% labeledand 80% unlabeled data, and for comparison test, 80% train and 20%test in labeled data with all the unlabeled data.

197

Page 6: An ensemble semi-supervised learning method for predicting ...sclab.yonsei.ac.kr/publications/Papers/IJ/2019_EAAI_ALK.pdf · away data having different features. Dempster–Shafer

A. Kim and S.-B. Cho Engineering Applications of Artificial Intelligence 81 (2019) 193–199

Fig. 8. Performance in comparison test.

Table 4Prediction of label distribution.

Proposed method Label propagation TSVM

Label Charged off Fully paid Charged off Fully paidCharged off 3,123 38 2,433 728Fully paid 699 273,082 33 273,748Unknown 23,289 10,511 10,511 23,287

4.2. Labeling test and analysis

Fig. 5 shows the result with the conventional methods. The pro-posed method achieved the best in term of accuracy and F1 score.The number of data is smaller than others. It can be shown thatthe proposed method gives more accurate labels than others. Table 3shows the analysis for class probability and Table 3(a) is a case whereall methods give correct labels, (b) is where label propagation givesincorrect labels but the proposed method corrects the labels, and (c) isremoved case. Table 3(b) shows a cause of better performances than theconventional methods. The small amount of the labeled result comesfrom Table 3(c). Although Dempster–Shafer fusion combines the twoprobabilities, the probabilities of two classes are the same so that thedata point is removed. That means that unreliable data are removed sothat the labeling performance can be improved.

We can investigate the class distribution of the unlabeled dataduring fusion. Table 4 shows the distribution of the entire data. Inthe proposed method, 80% of data is labeled as ‘‘fully paid’’, 1% as‘‘charged off’’ and the rest 10% as ‘‘unknown’’ of the proposed methodare removed because of equal combined class probability. Also, slightdifferences between each prediction are observed. Table 4 shows theclass distribution of each fusion result for the proposed method, whichis more influenced from label propagation than TSVM for ‘‘chargedoff’’ label data. Most of labels, however, are the same but slightlydifference from the label propagation, which means that TSVM hasmore influential on the proposed method.

Fig. 6 shows the probability distribution of two semi-supervisedlearning methods as boxplots per a range of the combined probabilityin x-axis. The y-axis begins 0.0 as ‘‘charged off’’ and ends 1.0 as ‘‘fullypaid’’. In Figs. 6(a) and (b), the variations of TSVM are small andlabel propagation is widely spread. Fig. 6(a) shows the cases that theproposed method labels as ‘‘charged off’’. The two boxplots near 𝑥 = 0.4are spread vertically through 0.0 to 1.0, meaning that these data arehard to identify labels but lower probabilities for label propagationcontribute these data labeled as ‘‘charged off’’. Fig. 6(b) shows ‘‘fullypaid’’ case with an inverse result, while TSVM leads to labeling the datanear 𝑥 = 0.3 as ‘‘fully paid’’.

Fig. 7 shows the distribution of each attribute value per label foreach method, and ‘‘actual’’ means the distribution of labeled data. They-axis begins 0.0 as the minimum value in the attribute and ends 1.0 asthe maximum value. Fig. 7(a) shows the distribution for the number ofactive revolving trades, one of credit histories. It contains few outliersin actual data with some differences in label propagation and TSVM,so that the proposed method is located in the middle between thetwo methods. Fig. 7(b) shows the number of total installment account,one measure of the total account, with many outliers. TSVM near𝑦 = 0.6∼0.7 is considered as misclassified, but the proposed method isalmost the same as actual distribution because of reflecting the distribu-tion of label propagation. Fig. 6(c) shows the number of open accountas one measure of total account with a few outliers, but the proposedmethod has almost the same to the actual distribution. Overall, labelpropagation contributes to label as ‘‘charged off’’ and TSVM leads tothe label ‘‘fully paid’’. Some misclassifications in TSVM are correctedby the proposed method, based on the fusion with label propagation.Through an in-depth analysis of the probabilities and attribute values,the proposed method can produce better performance by exploiting thelabel propagation and TSVM.

4.3. Comparison test

Fig. 8 shows the result of comparison test. The supervised learningindicates a decision tree classifier trained with labeled data only andthe performances are the lowest. The semi-supervised learning classi-fier, label propagation and TSVM are conducted with unlabeled data,but these are 1∼2% lower than the proposed method. Co-training is asemi-supervised learning method using two supervised classifiers likedecision tree, and has been a good performance in various applica-tions (Blum and Mitchell, 1998; Yu and Cho, 2016). For the ensemblemethod, we compared with simple averaging to show the superiorityof the proposed method. What TSVM is better than simple averagingensemble is confirmed by the two semi-supervised learning classifierswhich have different predictions so that the averaging method becomesbad.

5. Concluding remarks

Default prediction in social lending is an important yet difficultissue. However, most of loans are in progress, so that we can knowonly the completed repaid loans that have been determined whetherthe loans are default or not. There are a few labeled data and lots ofunlabeled data in social lending. In this paper, we have proposed andanalyzed an ensemble semi-supervised learning method for predictingdefault in social lending.

198

Page 7: An ensemble semi-supervised learning method for predicting ...sclab.yonsei.ac.kr/publications/Papers/IJ/2019_EAAI_ALK.pdf · away data having different features. Dempster–Shafer

A. Kim and S.-B. Cho Engineering Applications of Artificial Intelligence 81 (2019) 193–199

Traditional classifiers utilize labeled data only, but unlabeled datais easy to get. Labeling unlabeled data is expensive, time-consuming,and costly. Semi-supervised learning method is trained with few labeleddata and a large amount of unlabeled data. Labeling unlabeled datais important issue because the performance is dependent on it. Tosolve this problem, in this paper, we consider two assumptions byutilizing label propagation and TSVM to reduce the influences fromincorrect labels and get correct labels. The proposed method combinesthe two semi-supervised classifiers with Dempster–Shafer fusion. Wehave conducted experiments with almost 70% unlabeled data, and theproposed method can give more accurate labels than the conventionalensemble methods.

In the future, we will extend the proposed method to multi-labelclassification and soft label classification. Also, we will develop theproposed method with the state-of-the-art deep semi-supervised learn-ing methods and theories (Rasmus et al., 2015; Dai and Van Gool,2016; Mallapragada et al., 2009; Lee, 2013). Because most of real-world problems have large unlabeled data, we will use them in severalapplications and show the usefulness.

Acknowledgment

This work was supported by Defense Acquisition Program Admin-istration and Agency for Defense Development under the contract(UD160066BD).

References

Bernardo, J.M., Smith, A.F., 2001. Bayesian Theory. IOP Publishing.Bitvai, Z., Cohn, T., 2015. Predicting peer-to-peer loan rates using Bayesian non-linear

regression. In: AAAI Conference on Artificial Intelligence. pp. 2203–2209.Blum, A., Chawla, S., 2001. Learning from labeled and unlabeled data using graph

mincuts.Blum, A., Mitchell, T., 1998. Combining labeled and unlabeled data with co-training.

In: Annual Conference on Computational Learning Theory. ACM, pp. 92–100.Byanjankar, A., Heikkilä, M., Mezei, J., 2015. Predicting credit risk in peer-to-peer

lending: A neural network approach. In: IEEE Symposium Series on ComputationalIntelligence. IEEE, pp. 719–725.

Chapelle, O., Zien, A., 2005. Semi-supervised classification by low density separation.In: AISTATS. Citeseer, pp. 57–64.

Chen, Y., 2017. Research on the credit risk assessment of Chinese online peer-to-peer lending borrower on logistic regression model. In: DEStech Transactions onEnvironment, Energy and Earth Sciences.

Dai, D., Van Gool, L., 2016. Unsupervised high-level feature learning by ensembleprojection for semi-supervised image classification and image clustering. arXivpreprint arXiv:160200955.

Dempster, A.P., 1967. Upper and lower probabilities induced by a multivalued mapping.Ann. Math. Stat. 325–339.

Emekter, R., Tu, Y., Jirasakuldech, B., Lu, M., 2015. Evaluating credit risk and loanperformance in online Peer-to-Peer (P2P) lending. Appl. Econ. 47, 54–70.

Everett, C.R., 2015. Group membership, relationship banking and loan default risk: Thecase of online social lending. Bank. Financ. Rev. 7, 1–31.

Fu, Y., 2017. Combination of random forests and neural networks in social lending. J.Financ. Risk Manag. 4 (6), 418–426.

Ge, R., Feng, J., Gu, B., Zhang, P., 2017. Predicting and deterring default with socialmedia information in peer-to-peer lending. J. Manage. Inf. Syst. 34, 401–424.

Guo, Y., Zhou, W., Luo, C., Liu, C., Xiong, H., 2016. Instance-based credit riskassessment for investment decisions in P2P lending. European J. Oper. Res. 249(2), 417–426.

Huo, Y., Chen, H., Chen, J., 2017. Research on personal credit assessment based onneural network-logistic regression combination model. Open J. Bus. Manag. 5, 244.

Jiang, C., Wang, Z., Wang, R., Ding, Y., 2017a. Loan default prediction by combiningsoft information extracted from descriptive text in online peer-to-peer lending. Ann.Oper. Res.

Jiang, C., Wang, Z., Wang, R., Ding, Y., 2017b. Loan default prediction by combiningsoft information extracted from descriptive text in online peer-to-peer lending. Ann.Oper. Res. 1–19.

Joachims, T., 1999. Transductive inference for text classification using support vectormachines. In: International Conference on Machine Learning. pp. 200–209.

Kim, A., Cho, S.-B., 2017. Dempster-Shafer fusion of semi-supervised learning methodsfor predicting defaults in social lending. In: International Conference on NeuralInformation Processing. Springer, pp. 854–862.

Lee, D.-H., 2013. Pseudo-label: The simple and efficient sem-supervised learning methodfor deep neural networks. In: International Conference on Machine LearningWorkshop on Challenges in Representation Learning. pp. 1–6.

Lending Club Statistics, https://www.lendingclub.com/info/statistics.action.Li, W., Ding, S., Chen, Y., Yang, S., 2018. Heterogeneous ensemble for default prediction

of peer-to-peer lending in China. IEEE Access.Li, J., Hsu, S., Chen, Z., Chen, Y., 2016. Risks of P2P lending platforms in China:

Modeling failure using a COX hazard model. Chin. Econ. 49, 161–172.Li, Z., Tian, Y., Li, K., Zhou, F., Yang, W., 2017. Reject inference in credit scoring using

semi-supervised support vector machines. Expert Syst. Appl. 74, 105–114.Lin, X., Li, X., Zheng, Z., 2017a. Evaluating borrower’s default risk in peer-to-peer

lending: Evidence from a lending platform in China. Appl. Econ. 49, 3538–3545.Lin, X., Li, X., Zheng, Z., 2017b. Evaluating borrower’s default risk in peer-to-

peer lending: Evidence from a lending platform in China. Appl. Econ. 49 (35),3538–3545.

Luo, C., Xiong, H., Zhou, W., Guo, Y., Deng, G., 2011. Enhancing investment decisionsin P2P lending: An investor composition perspective. In: ACM SIGKDD Internationalconference on Knowledge Discovery and Data Mining. ACM, pp. 292–300.

Ma, X., Sha, J., Wang, D., Yu, Y., Yang, Q., Niu, X., 2018. Study on a predictionof P2P network loan default based on the machine learning LightGBM andXGboost algorithms according to different high dimensional data cleaning. Electron.Commer. Res. Appl. 31, 24–39.

Malekipirbazari, M., Aksakalli, V., 2015. Risk assessment in social lending via randomforests. Expert Syst. Appl. 42, 4621–4631.

Mallapragada, P.K., Jin, R., Jain, A.K., Liu, Y., 2009. Semiboost: Boosting forsemi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 31, 2000–2014.

Polena, M., Regner, T., 2016a. Determinants of borrowers’ default in P2P lending underconsideration of the loan risk class. Jena Econ. Res. Pap.

Polena, M., Regner, T., 2016b. Determinants of borrowers’ default in P2P lending underconsideration of the loan risk class. Jena Econ. Res. Pap. 23, 1–30.

Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T., 2015. Semi-supervisedlearning with ladder networks. Adv. Neural Inf. Process. Syst. 3546–3554.

Sentz, K., Ferson, S., 2002. Combination of Evidence in Dempster-Shafer Theory.Citeseer.

Serrano-Cinca, C., Gutiérrez-Nieto, B., 2016. The use of profit scoring as an alternativeto credit scoring systems in Peer-to-Peer (P2P) lending. Decis. Support Syst. 89,113–122.

Serrano-Cinca, C., Gutiérrez-Nieto, B., López-Palacios, L., 2015. Determinants of defaultin P2P lending. PLoS One 10, 1–22.

Shafer, G., 1976. A Mathematical Theory of Evidence. Princeton University Press.Wang, Z., Jiang, C., Ding, Y., Lyu, X., Liu, Y., 2018. A novel behavioral scoring model

for estimating probability of default over time in peer-to-peer lending. Electron.Commer. Res. Appl. 27, 74–82.

Xia, Y., Liu, C., Liu, N., 2017. Cost-sensitive boosted tree for loan evaluation inpeer-to-peer lending. Electron. Commer. Res. Appl. 24, 30–49.

Yan, J., Yu, W., Zhao, J.L., 2015. How signaling and search costs affect informationasymmetry in P2P lending: The economics of big data. Financ. Innov. 1, 19.

Yu, J.-M., Cho, S.-B., 2016. Prediction of bank telemarketing with co-training ofmixture-of-experts and MLP. In: International Conference on Neural InformationProcessing. Springer, pp. 52–59.

Zhang, Y., Wang, D., Chen, Y., Shang, H., Tian, Q., 2017. Credit risk assessmentbased on long short-term memory model. In: International Conference on IntelligentComputing. Springer, pp. 700–712.

Zhao, H., Ge, Y., Liu, Q., Wang, G., Chen, E., Zhang, H., 2017. P2P lending survey:Platforms, recent advances and prospects. ACM Trans. Intell. Syst. Technol. 8, 72.

Zhou, G., Zhang, Y., Luo, S., 2018. P2P network lending, loss given default and creditrisks. Sustainability 10, 1010.

Zhu, X., 2011. Semi-supervised learning. In: Encyclopedia of Machine Learning.Springer, pp. 892–897.

199