Kolmogorov – Smirnov Statistic, Description: Measuring model power in credit risk modeling

Machine Learning


On the day, people are taking more loans than ever before. If a mortgage is available and you own a property, you can get a real estate loan for those who want to build their own home. There are also agricultural loans, educational loans, business loans, and gold loans.

In addition to these, there is also the EMI option to buy items such as TVs, refrigerators, furniture, and mobile phones.

But does anyone approve the loan application?

The bank will not grant loans to everyone that applies. There is a process to follow to approve the loan.

We know that machine learning and data science are currently applied across the industry, and banks are using them too.

If a customer applies for a loan, the bank needs to know the possibility that the customer will pay off on time.

For this purpose, banks use predictive models primarily based on logistic regression or other machine learning methods.

We already know that by applying these methods, a probability is assigned to each applicant.

This is a classification model, and it requires classification of Defaulters and non-abolition agents.

Deformation Matter: Customers who have failed to repay their loan (missing payment or stopping payment entirely).

Non-Folder: A customer who pays off their loans on time.

We have already discussed accuracy and ROC-AUC to evaluate classification models.

In this article, I will explain it Kolmogorov-Smirnov Statistics (KS Statistics) It is used to evaluate classification models, particularly in the banking sector.

We use German credit datasets to understand KS statistics.

This dataset contains information about 1,000 loan applicants and is described in 20 features including account status, loan duration, credit amount, employment, housing, and personal status.

The target variable indicates whether the applicant is a non-rescue factor (represented by 1) or a default (represented by 2).

Information about the dataset can be found here.

Next, you need to build a classification model to classify applicants. Since this is a binary classification problem, we apply logistic regression to this dataset.

code:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load dataset
file_path = "C:/german.data"
data = pd.read_csv(file_path, sep=" ", header=None)

# Rename columns
columns = [f"col_{i}" for i in range(1, 21)] + ["target"]
data.columns = columns

# Features and target
X = pd.get_dummies(data.drop(columns=["target"]), drop_first=True)
y = data["target"]   # keep as 1 and 2

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train logistic regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predicted probabilities
y_pred_proba = model.predict_proba(X_test)

# Results DataFrame
results = pd.DataFrame({
    "Actual": y_test.values,
    "Pred_Prob_Class2": y_pred_proba[:, 1]
})

print(results.head())

We already know that applying logistic regression gives us predicted probability.

Images by the author

Next, let's consider a 10-point example from this output to understand how to calculate KS statistics.

Images by the author

Here, the highest predicted probability is 0.92. This means that there is a 92% chance that this applicant will default.

Next, let's proceed with the KS statistics calculation.

First, we sort applicants in descending order with predicted probability, so that the higher risk applicants are at the top.

Images by the author

We already know that “1” represents a non-abolition agent and “2” represents a deulteration.

The next step calculates the cumulative counts of non-absorbent and de-evacuators at each step.

Images by the author

The next step is to convert boring cumulative counts and non-biased people to cumulative rates.

Cumulative disabled persons are removed by the total number of people with disabilities, and cumulative non-abolition drugs are divided by the total number of non-abolition drugs.

Images by the author

Next, calculate the absolute difference between the accumulated failure rate and the accumulated non-discard rate.

Images by the author

The maximum difference between cumulative failure rate and cumulative non-discarding rate is 0.83, which is the KS statistics for this sample.

Here, the KS statistics were 0.83, with a probability of 0.29.

This means that the model captures deficiency agents 83% more effectively than non-abolition factors at this threshold.


Here we can observe it:

Cumulative failure rate = true positive rate (number of actual ejectors captured so far).

Cumulative non-emission rate = false positive rate (number of non-abuse agents mistakenly captured as disabled persons).

But since we haven't corrected the threshold here, how can we get the true positive and false positive rates?

Let's take a look at how the cumulative rate is equal to TPR and FPR.

First, we consider all probability as a threshold and calculate the TPR and FPR.

\[
\begin{aligned}
\mathbf{At\ threshold\ 0.92:} & \\[4pt]
tp&=1,\quad fn=3,\quad fp=0,\quad tn=6\\[6pt]
tpr&=\tfrac {1}{4}=0.25\\[6pt]
fpr&=\tfrac {0}{6}=0\\[6pt]
\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0, \, 0.25)
\end {aligned}
\]

\[
\begin{aligned}
\mathbf{At\ threshold\ 0.63:} & \\[4pt]

tp&=2,\quad fn=2,\quad fp=0,\quad tn=6\\[6pt]

tpr&=\tfrac {2}{4}=0.50\\[6pt]

fpr&=\tfrac {0}{6}=0\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0, \, 0.50)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.51:} & \\[4pt]

tp&=3,\quad fn=1,\quad fp=0,\quad tn=6\\[6pt]

tpr&=\tfrac {3}{4} = 0.75\\[6pt]

fpr&=\tfrac {0}{6}=0\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0, \, 0.75)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.39:} & \\[4pt]

tp&=3,\quad fn=1,\quad fp=1,\quad tn=5\\[6pt]

tpr&=\tfrac {3}{4} = 0.75\\[6pt]

fpr&=\tfrac {1}{6}\approx.0.17\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0.17, \, 0.75)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.29:} & \\[4pt]

tp&=4,\quad fn=0,\quad fp=1,\quad tn=5\\[6pt]

tpr&=\tfrac {4}{4}=1.00\\[6pt]

fpr&=\tfrac {1}{6}\approx.0.17\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0.17, \, 1.00)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.20:} & \\[4pt]

tp&=4,\quad fn=0,\quad fp=2,\quad tn=4\\[6pt]

tpr&=\tfrac {4}{4}=1.00\\[6pt]

fpr&=\tfrac {2}{6}\approx.0.33\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0.33, \, 1.00)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.13:} & \\[4pt]

tp&=4,\quad fn=0,\quad fp=3,\quad tn=3\\[6pt]

tpr&=\tfrac {4}{4}=1.00\\[6pt]

fpr&=\tfrac {3}{6}=0.50\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0.50, \, 1.00)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.10:} & \\[4pt]

tp&=4,\quad fn=0,\quad fp=4,\quad tn=2\\[6pt]

tpr&=\tfrac {4}{4}=1.00\\[6pt]

fpr&=\tfrac {4}{6}\approx.0.67\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0.67, \, 1.00)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.05:} & \\[4pt]

tp&=4,\quad fn=0,\quad fp=5,\quad tn=1 \\[6pt]

tpr&=\tfrac {4}{4}=1.00\\[6pt]

fpr&=\tfrac {5}{6}\approx.0.83\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(0.83, \, 1.00)
\end {aligned}
\]\[
\begin{aligned}
\mathbf{At\ threshold\ 0.01:} & \\[4pt]

tp&=4,\quad fn=0,\quad fp=6,\quad tn=0 \\[6pt]

tpr&=\tfrac {4}{4}=1.00\\[6pt]

fpr&=\tfrac {6}{6}=1.00\\[6pt]

\rightArrow(\mathrm {fpr}, \, \mathrm {tpr}) &=(1.00, \, 1.00)
\end {aligned}
\]

From the above calculations, it can be seen that the cumulative disability rate corresponds to true positive velocity (TPR), and the cumulative non-biased rate corresponds to false positive rate (FPR).

When calculating the cumulative default rate and the cumulative non-default rate, each row represents a threshold value, and the rate is calculated up to that row.

Here we can observe that KS statistics = max (| tpr – fpr |).


Next, let's calculate the KS statistics for the complete dataset.

code:

# Create DataFrame with actual and predicted probs
results = pd.DataFrame({
    "Actual": y.values,
    "Pred_Prob_Class2": y_pred_proba
})

# Mark defaulters (2) and non-defaulters (1)
results["is_defaulter"] = (results["Actual"] == 2).astype(int)
results["is_nondefaulter"] = 1 - results["is_defaulter"]

# Sort by predicted probability
results = results.sort_values("Pred_Prob_Class2", ascending=False).reset_index(drop=True)

# Totals
total_defaulters = results["is_defaulter"].sum()
total_nondefaulters = results["is_nondefaulter"].sum()

# Cumulative counts and rates
results["cum_defaulters"] = results["is_defaulter"].cumsum()
results["cum_nondefaulters"] = results["is_nondefaulter"].cumsum()
results["cum_def_rate"] = results["cum_defaulters"] / total_defaulters
results["cum_nondef_rate"] = results["cum_nondefaulters"] / total_nondefaulters

# KS statistic
results["KS"] = (results["cum_def_rate"] - results["cum_nondef_rate"]).abs()
ks_value = results["KS"].max()
ks_index = results["KS"].idxmax()

print(f"KS Statistic = {ks_value:.3f} at probability {results.loc[ks_index, 'Pred_Prob_Class2']:.4f}")

# Plot KS curve
plt.figure(figsize=(8,6))
plt.plot(results.index, results["cum_def_rate"], label="Cumulative Defaulter Rate (TPR)", color="red")
plt.plot(results.index, results["cum_nondef_rate"], label="Cumulative Non-Defaulter Rate (FPR)", color="blue")

# Highlight KS point
plt.vlines(x=ks_index,
           ymin=results.loc[ks_index, "cum_nondef_rate"],
           ymax=results.loc[ks_index, "cum_def_rate"],
           colors="green", linestyles="--", label=f"KS = {ks_value:.3f}")

plt.xlabel("Applicants (sorted by predicted probability)")
plt.ylabel("Cumulative Rate")
plt.title("Kolmogorov–Smirnov (KS) Curve")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

plot:

Images by the author

The maximum gap is 0.2928 and has a probability of 0.530.


Now that you know how to calculate KS statistics, let's discuss the importance of this statistics.

Here, we created a classification model and evaluated it using KS statistics, but there are also other classification metrics such as accuracy, ROC-AUC.

We already know that accuracy is specific to one threshold and varies with the threshold.

ROC-AUC provides a number indicating the overall ranking ability of the model.

But why are KS statistics used by banks?

KS statistics provide a single number. This represents the maximum gap between the cumulative distribution of deulteration and nonabolizers.

Let's go back to the sample data.

I got a KS statistic of 0.83 with a probability of 0.29.

We have already explained that each row acts as a threshold.

So, what happened at 0.29?

Threshold = 0.29 means that the probability is greater than or equal to 0.29.

At 0.29, the top five rows flagged themselves as defaulters. Of these five, four are actual disabled people, and one is a non-falter who is not falsely predicted as a deny.

where true positive = 4 and false positive = 1.

The remaining 5 lines are predicted as non-faulters.

At this point, this model captured all four exiles and one non-abolition factor incorrectly flagging them as defalters.

Here, the TPR is maxed at 1 and the FPR is 0.17.

Therefore, KS statistics = 1-0.17 = 0.83.

If we calculate further probabilities as we did earlier, we can observe that there is no change in the TPR, but that the FPR will increase.

This reduces the gap between the two groups.

Here, at 0.29, the model rejected 17% of all disabled and non-abolition drugs (according to sample data) and approved 83% of escape machines.


Does the bank determine the threshold based on KS statistics?

The KS statistics show the maximum gap between the two groups, but the banks do not determine the threshold based on this statistics.

KS statistics are used to verify the strength of the model, but the actual thresholds are determined by considering risk, profitability and regulatory guidelines.

If the KS is below 20, it is considered a weak model.
If it is between 20 and 40, it is considered acceptable.
If the KS is in the 50-70 range, it is considered a good model.


Dataset

The dataset used in this blog is a German credit dataset published in the UCI Machine Learning repository. Available under a Creative Commons Attribution 4.0 International (CC by 4.0) license. This means it can be freely used and shared with appropriate attribution.


I hope this blog post gives you a basic understanding of Kolmogorov-Smirnov statistics. If you enjoy reading, consider sharing it with your network. And feel free to share your thoughts.

ROC – If you haven't read my blog yet at AUC, you can check it out here.

Thank you for reading!



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *