Asian Review of Financial Research Vol.38 No.3 pp.1-36
https://www.doi.org/10.37197/ARFR.2025.38.3.1
Analysis on Corporate Credit Scoring Models and Key Financial Variables Using Machine Learning
Key Words : Credit Scoring,Random Forest,XGBoost,CatBoost,SHAP
Abstract
Credit scoring is essential for assessing financial soundness and serves as a fundamental tool for loan screening, capital allocation, and risk management in financial institutions. The accuracy and reliability of credit scoring models are directly linked to financial system stability, making their continuous improvement essential. Traditional models primarily rely on Generalized Linear Models (GLM), particularly Logistic Regression. While these models provide interpretable relationships between financial variables and default risk, they are constrained by their linear functional form and reliance on a limited set of features. This restricts their adaptability to evolving financial markets and the increasing availability of unstructured data sources. Advancements in machine learning (ML) and artificial intelligence (AI) have introduced various models to enhance predictive accuracy and address the limitations of conventional credit scoring models. ML-based approaches such as Random Forest, Support Vector Machines (SVM), XGBoost, and LightGBM, along with deep learning techniques, have been widely applied to credit risk modeling. These methods process large volumes of financial and transactional data, capturing complex patterns in credit risk assessment. However, their adoption requires further validation regarding interpretability and regulatory compliance. This study makes four key contributions to credit scoring research. First, unlike previous studies that relied on subjectively selected financial variables, we incorporate all financial features collected by credit agencies and adopt a data-driven selection approach, minimizing researcher bias and ensuring greater objectivity. This enables us to identify the most relevant predictors based on empirical evidence rather than predetermined assumptions. Second, we address the class imbalance issue, a common challenge in credit risk modeling. Since default cases are rare, traditional logistic regression models often suffer from biased estimates, where the model underweights defaulting firms. To mitigate this, we apply the Synthetic Minority Oversampling Technique (SMOTE) to balance the dataset before applying ML techniques. Third, we integrate multiple ML techniques to derive a comprehensive interpretation of feature importance. Specifically, we compare classification performance across Random Forest, Extreme Gradient Boosting (XGBoost), and Category Boosting (CatBoost). Unlike prior studies that analyze a single ML model independently, our approach integrates feature importance rankings across multiple models, providing a more robust estimation of the importance of financial variables in credit risk. Fourth, while ML models enhance predictive accuracy, their complexity can hinder interpretability, making adoption challenging for financial institutions. This study emphasizes the importance of explainable AI (XAI) in credit scoring. By applying Shapley Additive Explanations (SHAP), we provide insights into how key financial variables influence credit risk and default probabilities, offering practical guidance on the appropriateness of financial variables and threshold settings used in credit scoring. This study analyzes credit scoring data of manufacturing firms evaluated by Korea Enterprise Assessment from 2010 to 2024. By applying multiple ML techniques, we identify key financial variables influencing credit risk and integrate results for a comprehensive interpretation. Our analysis highlights differences between realized credit risk, which reflects actual defaults and missed payments, and implied credit risk, which is assessed by the current credit risk model. Realized credit risk is primarily driven by short-term liquidity and profitability indicators, such as inventory turnover period, current ratio, return on equity, and return on capital employed. In contrast, implied credit risk is largely influenced by firm size and long-term financial stability, with key variables including EBITDA, cost-to-sales ratio, pre-tax continuous operating income, total sales, and total liabilities. These findings suggest that while current credit scoring models emphasize long-term financial health, actual credit events are more influenced by short-term financial constraints. This discrepancy underscores the need to supplement credit scoring models by incorporating financial variables, particularly those related to short-term liquidity, especially for high-risk firms. Further analysis reveals that the importance of financial variables varies across rating levels. For A-level firms, short-term financial stability and debt repayment capacity are critical, emphasizing the importance of liquidity management. In contrast, B-level firms are more affected by structural financial indicators such as the debt-to-equity ratio and capital adequacy ratio, highlighting the significance of long-term solvency and debt management. These differences underscore the need to tailor credit scoring criteria based on risk levels. SHAP results indicate that while higher debt-to-equity and capital growth ratios generally reduce the likelihood of default, their impact on credit risk is nonlinear. This suggests that simple threshold-based classification may be insufficient for credit scoring. Instead, a more nuanced approach that accounts for interactions between financial indicators and their varying effects across credit risk levels is needed. Beyond feature importance analysis, we examine credit transitions. Credit scores evolve based on firms' financial conditions. Our findings show that while most firms maintain stable credit scores, downgrades occur more frequently than upgrades, particularly within the B-level category between 2022 and 2023. While some A-level firms experienced rating upgrades between 2019 and 2022, the trend shifted toward downgrades from 2022 to 2023. These patterns highlight the need for dynamic credit transition models that account for temporal changes in creditworthiness.