Asian Review of financial Research

Asian Review of Financial Research Vol.34 No.4 pp.199-234 https://www.doi.org/10.37197/ARFR.2021.34.4.6

Predicting Loan Delinquency by Analyzing Sample DB with Machine Learning

Minchan Song Manager, NICE Dun&Bradstreet Co., Ltd.
Doojin Ryu* Professor, Department of Economics, Sungkyunkwan University

Key Words : Corporate Loan Data,Distress Risk,Machine Learning,Predicting Loan Delinquency,Sample DB Remote Analysis System

Abstract

This paper investigates the ability to predict corporate default rates using loan-sample data from the Korea Credit Information Service's financial big data open system (CreDB). The corporate loan from financial institution increases financial institution's credit exposure. Because measurement of the impact on the credit risk in the financial institution is used in determining the pricing model and structure of loan products, it is an essential factor for the financial institution that affects its profit structure. In terms of risk management, predicting delinquency using loan data is necessary for 5,000 Korean financial institutions. In several studies, bankruptcy forecasting was conducted on listed companies that disclosed financial and stock price information. However, this study increases the practical utility by extending the analysis target to individual entrepreneurs and small and medium-sized enterprises(SMEs). In addition, this study presents representative big data analysis results by utilizing loan, delinquency, and technology credit information of approximately 1.1 million corporations, which is 20% of almost 5.6 million domestic sole proprietors and non-listed corporations. For loan data, it includes ten monthly loan type codes and eleven overdue reason codes. Prediction targets are separated by individual and corporate entrepreneurs. Also, analyses are divided by use of the processed dataset. For efficient analysis, the data dimension was reduced by changing the table structure through nested iterative operations while expanding the variable composition from a table consisting of N rows to one column. To reflect the characteristics of the data as much as possible, exploratory data analysis and feature-engineering were performed to process the data. Also, classification models are classified by four groups using a parametric method that nine models train for classification. Group 1 consists of Logistic Regression and Linear discriminant analysis based on the parametric method, group 2 consists of several algorithms that calculate the distance for model learning. In addition, group 3 consists of tree-based algorithms, which are also non-parametric methods. Group 4 consists of the semi-parametric method, which is deep neural network. However, out of the total 438,697 corporations, 810 defaulted, accounting for only 0.2% of the forecast, so the target distribution is severely imbalanced. For this reason, before model fitting, under sampling of imbalanced data was performed. The bias of the sampled training and validation data is minimized by performing. K-fold cross validation as much as the level of K=5. Finally, the analysis result suggests a significant effect on classification performance when the processed data is used. However, this study suggests no significant effect on performance when loan owner's characteristics are included. Moreover, tech-credit rating (TCB) information gives any meaningful effect regarding the type of corporation. Also, classification with Deep Neural Network (DNN), which is based on the Semi-parametric method, makes the best performance of binary classification. Non-parametric and Non-tree based models are not appropriate methods for analyzing loan data. In the case of the DNN based on the semi-parametric methodology, the highest classification performance was confirmed for all analyses and entrepreneurs' classifications performed in this study. The neural network used in this study consists of 14 hidden layers. According to the neural network baseline design, the sigmoid function was applied to the activation function's initial value, the relu function was applied to the hidden layer, and optimization was performed through the Adam optimizer. In particular, the analysis of credit transaction information based on credit information of all financial institutions in Korea was conducted, and there is a possibility for alleviating information asymmetry of individual credit institutions regarding risk management targets. In addition, in the case of parametric methodologies used in classical studies and most used in practice, the average classification performance for major segments was inferior to that of semi-parametric methodologies. Furthermore, the difference between these performances is up to 16 percent. This paper suggests the direction of using loan-sample data. It is foundational research for financial institutions that are using loan data for credit risk management. It is necessary to expand research focusing on semi-parametric methodologies about corporate credit information analysis.

LIST

Past Issues

Abstract