Validating and Developing Scoring Models
An Introduction to Credit Scoring1
Credit scoring is now widely employed in banks. In 1996, the Federal Reserve's November Senior Loan Officer Opinion Survey2 of Bank Lending Practices showed that 97% of U.S. banks were using internal scoring models for credit card applications. For small business lending, the figure was 70%. The use of internal quantitative credit scoring models has increased over the past five years, and is expected to rise further with the focus of Basel II on probabilities of default. The use of external credit scoring models has also increased since the ZETA model, an application of Altman's Z score, was introduced in 1977. MKMV's CreditMonitor, based on Merton model linking stock price changes to the probability of default, was introduced in 1991 and is widely used in larger banks today as an early warning system. Today, there are several providers of credit models and credit information services in the market.
While Altman's Z scores are now 35 years old, and the Merton methodology behind MKMV's model is 30 years old, research in this area has not stood still. Data mining methodologies that may be applied to the credit scoring problem, statistical pattern recognition in particular, is flourishing research area. In 2000, for example, Jain, Duin, and Mao carried out a short survey3 that included most of the significant and recent contributions in statistical pattern recognition, listing more than 150 working papers and published articles. Since then, a number of new fields such as support vector machines have been investigated in depth. From a practical standpoint, this means that most of the current statistical models applied to credit risk lag far behind state-of-the-art methods. As a consequence, it is expected that many banks may seek a competitive advantage by catching up with the integration of nonparametric techniques and machine learning methods over the next few years.
In addition to developing and implementing these state-of-the-art modeling techniques, it will be important to accurately assess the performance of the models being developed or considered for purchase. Friedman and Sandow (2003) have shown how longstanding measures such as accuracy ratios or receiver operating characteristic (ROC) curve ratios can provide misleading results. That is, while those measures inform the user as to whether the model is ordering firm correctly, they fail to inform the user as to whether the probabilities are correct. In order to accurately asses a model, performance must be judged by model's ability to provide more accurate information (probabilities of default) to the investor (or lender) who uses the model to make investment (or lending) decisions. As implied, a model's performance must be considered relative to a next best alternative. In order for a fair comparison to be made, the same data set should be used in each competing model.
Other factors to consider include:
- A model's ability to handle poor quality data that may have missing values or many outlying observations
- A vendor's willingness to disclose the underlying modeling methodology and drivers of model performance
- The time period of data used to calibrate the model, and whether variables are used to capture macroeconomic effect such as growth or recession
- The universe of data and variables used to build the model
Model Validation Using the Utility-Based Wealth Growth Pickup Performance Measure
Introduction to Performance Measure
Modern financial theory rests on two pillars, utility theory and arbitrage pricing theory, where utility theory describes the investment decisions of a rational economic agent under certain well-specified, plausible assumptions. An approach to performance measures for probabilistic models is firmly grounded in utility theory and may be viewed as a natural application of the ideas of utility theory in the model performance measure setting.
The utility-based approach describes model performance in economic terms that can easily be communicated to those on the business side of the firm. In addition, the utility-base approach ensures that resulting model performance measures are appropriate for a variety of financial modeling problems (for example, probability of default models, recovery models, and default correlation models).
In essence, the utility-based approach results in a single relative performance measure that may be interpreted as the estimated wealth growth pickup for a certain type of investor who uses one model versus an alternative model.
Validation Processes
The fact that a utility maximizing investor will choose a model consistent with the above performance measure implies that the investor is choosing between competing models. This validating a model requires a direct and fair comparison between the model that is under consideration and an alternative model (the benchmark model). While it is unfortunate that a direct and fair comparison between commercially available models is often difficult if not impossible (see below), it is possible to make a direct and fair comparison between a model that is being developed (the candidate model) and a benchmark model.
In order to have a direct and fair comparison between models it is necessary to hold certain aspects of the modeling process constant between the candidate model and the benchmark model; in particular it is necessary to hold the training set and holdout set constant, which is why a direct and fair comparison between commercially available models is often difficult.
Often, however, two commercially available models will need to be compared. In this instance the best that one can do is to score an identical data set using the two models. (Note that it would be preferable if the data set used for this were internal and not provided by one of the model vendors.) The user then computes the wealth growth pickup of using one model versus the alternative model. See Cangemi and Van de Castle (2002) for more complete description of evaluating credit risk models.
Several aspects of the modeling process may vary, including:
- Data pre-processing techniques
- How the training set is utilized (e.g. what percentage of the training set is actually used to build the model)
- How (if) the model output is processed
While the above aspects of the modeling process may vary, they are also subject to scrutiny and validation. For example, it is necessary to ensure that data pre-processing techniques are not arbitrary, where a variable might be arbitrary capped at a certain level. Some model building techniques involve using only a small fraction of the training set to ultimately build the model. This procedure may result in building a model on a particular sub-set of the state space that, through repeated sampling, may be manifest in an unstable performance measure. Finally, if the model output is processed one must ask not only how, but also why the output is being processed.
Repeated sampling is necessary in the validation process in order to ensure that model performance is not a function of a particularly fortuitous holdout sample. This will allow for the examination of stability in the performance measure.
The preceding discussion assumes that the population used to build the model, that is the training and holdout sets, is unbiased. This is closely related to the idea that the population used to build the model should be representative of the population to be evaluated by the model. If, for example, due to missing data, small firms are eliminated from the population used to build the model, a large firm bias will be introduced to the model. This in and of itself is not an issue unless the model will be used to evaluate small firms at some point in the future. For probability of default models, a bias will be introduced if the model is trained on a sample of 50% defaulters when the population default rate is 2%. (This bias could also necessitate post-processing of the model output.) Finally, a model that is valid for Southeast Asian telecommunications firm might not be valid for North American telecommunications firms, and probably will not be valid for European furniture manufacturers.
Model Development Considerations
It is important to use multiple firm year observations when possible and appropriate. For probability of default models, for example, Shumway (1999) shows that "single-period models give biased and inconsistent probability estimates."
Ideally the modeling methodology will be flexible enough to utilize a variety of variables including "qualitative," accounting, market and macroeconomic inputs. With this myriad of variable types to choose from, variable select will be important as well. While an "everything but the kitchen sink" approach might work in model development if a principal components analysis or other data handling technique is used, the computational intensity and production time increase with the number of variables used in the model. A parsimonious selection of non-highly correlated, statistically significant, and intuitively appealing variables will make the model more appealing to the ultimate user. Fewer variables that easily computed and rely on data that are easily accessed will facilitate data entry and ultimately make the model more user friendly.
Finally, it would also be ideal if the modeling methodology extended to related problems that may be encountered in the future (for example, if the methodology lends itself to computing the probability of default within one year, within two years, and between the first and second year).
Other Commonly Considered Measures
While the performance measure outlined above is that which a utility maximizing investor would use to choose between models, market participants may be more familiar with other measures. While we do not recommend these measures be relied upon to choose a model, they should be provided as a courtesy to those unfamiliar with the above performance measure. These auxiliary performance statistics include:
- Ability of the model to "order" firm or firm year observations correctly (e.g., ROC Curves, Gini Curves, or Power Curves), provided one is already using a measure that informs as to whether or not the model is predicting the default probabilities correctly (e.g. wealth growth pickup performance measure)
- Stability in model output across repeated random samples
- Stability in feature weights across repeated random samples
- Information regarding the "signs" of the features
- Ability of the model to get the probabilities (or other model output) "right" on average and in risk buckets
- Information regarding the most influential feature
Summary of Validation Protocol
Probabilistic models should do more than order firms correctly, they should get the probabilities right as well. A performance measure captures a model's ability to get the probabilities right and expresses that ability in an economically intuitive way - a wealth growth pickup. Five steps, summarized below, form the foundation of a validation protocol:
- Construct one or more benchmark models
- Train and test on same samples
- Compute relative performance measures (wealth growth pickup)
- Repeat steps 2 and 3 multiple times
- Select the best model based on the relative performance measures and when close, use the stability measure as a tie breaker.
Following these five steps, in addition to consideration of other aspects of the model development process mentioned above, will help to ensure that the best model possible is chosen, and that utility is maximized.
References
- R. Cangemi and K. Van de Castle, "Evaluating Credit Models" The Need for a Rigorous Approach," Credit (February, 2002).
- C. Friedman and S. Sandow, "Model Performance Measures for Expected Utility Maximizing Investors," International Journal of Theoretical and Applied Finance (June, 2003).
- G. Shumway, "Forecasting Bankruptcy More Accurately: A Simple Hazard Model," Working Paper, University of Michigan (1999).
Footnotes
- Extracted in part from the book "Measuring and Managing Credit Risk" A de Servigny, O Renault, 2003
- The Federal Reserve's January 1997 "Senior Loan Officer Opinion Survey of Bank Lending Practices"
- Jain K., Duin R. and J. Mao (2000) "Statistical Pattern Recognition: A Review," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22 No. 1
Source: Standard & Poor's
