Kolmogorov – Smirnov Statistic, Description: Measuring model power in credit risk modeling

On the day, people are taking more loans than ever before. If a mortgage is available and you own a property, you can get a real estate loan for those who want to build their own home. There are also agricultural loans, educational loans, business loans, and gold loans.

In addition to these, there is also the EMI option to buy items such as TVs, refrigerators, furniture, and mobile phones.

But does anyone approve the loan application?

The bank will not grant loans to everyone that applies. There is a process to follow to approve the loan.

We know that machine learning and data science are currently applied across the industry, and banks are using them too.

If a customer applies for a loan, the bank needs to know the possibility that the customer will pay off on time.

For this purpose, banks use predictive models primarily based on logistic regression or other machine learning methods.

We already know that by applying these methods, a probability is assigned to each applicant.

This is a classification model, and it requires classification of Defaulters and non-abolition agents.

Deformation Matter: Customers who have failed to repay their loan (missing payment or stopping payment entirely).

Non-Folder: A customer who pays off their loans on time.

We have already discussed accuracy and ROC-AUC to evaluate classification models.

In this article, I will explain it Kolmogorov-Smirnov Statistics (KS Statistics) It is used to evaluate classification models, particularly in the banking sector.

We use German credit datasets to understand KS statistics.

This dataset contains information about 1,000 loan applicants and is described in 20 features including account status, loan duration, credit amount, employment, housing, and personal status.

The target variable indicates whether the applicant is a non-rescue factor (represented by 1) or a default (represented by 2).

Information about the dataset can be found here.

Next, you need to build a classification model to classify applicants. Since this is a binary classification problem, we apply logistic regression to this dataset.

code:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load dataset
file_path = "C:/german.data"
data = pd.read_csv(file_path, sep=" ", header=None)

# Rename columns
columns = [f"col_{i}" for i in range(1, 21)] + ["target"]
data.columns = columns

# Features and target
X = pd.get_dummies(data.drop(columns=["target"]), drop_first=True)
y = data["target"]   # keep as 1 and 2

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train logistic regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predicted probabilities
y_pred_proba = model.predict_proba(X_test)

# Results DataFrame
results = pd.DataFrame({
    "Actual": y_test.values,
    "Pred_Prob_Class2": y_pred_proba[:, 1]
})

print(results.head())

We already know that applying logistic regression gives us predicted probability.

Next, let's consider a 10-point example from this output to understand how to calculate KS statistics.

Here, the highest predicted probability is 0.92. This means that there is a 92% chance that this applicant will default.

Next, let's proceed with the KS statistics calculation.

First, we sort applicants in descending order with predicted probability, so that the higher risk applicants are at the top.

We already know that “1” represents a non-abolition agent and “2” represents a deulteration.

The next step calculates the cumulative counts of non-absorbent and de-evacuators at each step.

The next step is to convert boring cumulative counts and non-biased people to cumulative rates.

Cumulative disabled persons are removed by the total number of people with disabilities, and cumulative non-abolition drugs are divided by the total number of non-abolition drugs.

Next, calculate the absolute difference between the accumulated failure rate and the accumulated non-discard rate.

The maximum difference between cumulative failure rate and cumulative non-discarding rate is 0.83, which is the KS statistics for this sample.

Here, the KS statistics were 0.83, with a probability of 0.29.

This means that the model captures deficiency agents 83% more effectively than non-abolition factors at this threshold.

Here we can observe it:

Cumulative failure rate = true positive rate (number of actual ejectors captured so far).

Cumulative non-emission rate = false positive rate (number of non-abuse agents mistakenly captured as disabled persons).

But since we haven't corrected the threshold here, how can we get the true positive and false positive rates?

Let's take a look at how the cumulative rate is equal to TPR and FPR.

First, we consider all probability as a threshold and calculate the TPR and FPR.

\[
\begin{aligned}
\mathbf{At\ threshold\ 0.92:} & \\[4pt]
tp&=1,\quad fn=3,\quad fp=0,\quad tn=6\\[6pt]
tpr&=\tfrac {1}{4}=0.25\\[6pt]
fpr&=\tfrac {0}{6}=0\\[6pt]
\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0, \, 0.25)
\end {aligned}
\]

\[
\begin{aligned}
\mathbf{At\ threshold\ 0.63:} & \\[4pt]

tp&=2,\quad fn=2,\quad fp=0,\quad tn=6\\[6pt]

tpr&=\tfrac {2}{4}=0.50\\[6pt]

fpr&=\tfrac {0}{6}=0\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0, \, 0.50)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.51:} & \\[4pt]

tp&=3,\quad fn=1,\quad fp=0,\quad tn=6\\[6pt]

tpr&=\tfrac {3}{4} = 0.75\\[6pt]

fpr&=\tfrac {0}{6}=0\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0, \, 0.75)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.39:} & \\[4pt]

tp&=3,\quad fn=1,\quad fp=1,\quad tn=5\\[6pt]

tpr&=\tfrac {3}{4} = 0.75\\[6pt]

fpr&=\tfrac {1}{6}\approx.0.17\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0.17, \, 0.75)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.29:} & \\[4pt]

tp&=4,\quad fn=0,\quad fp=1,\quad tn=5\\[6pt]