I won $10,000 in the machine learning competition – this is my complete strategy

Machine Learning


In my first ML competition, honestly, I'm still a bit shocked.

I have worked as a data scientist at FinTech for six years. When I saw Spectral Finance running a Web3 wallet credit scoring challenge, I decided to give it a try despite having zero experience with blockchain.

This was my limitation:

  • I used a computer without a GPU
  • I only had the weekend (~10 hours) to work on it
  • I've never been exposed to Web3 or blockchain data before
  • I've never built a neural network for credit scoring

The competitive goal was simple. Using transaction history, Web3 wallets that are likely to default on loans are predicted which Web3 wallets are the default. Essentially, traditional credit scoring uses debt data rather than bank statements.

To my surprise, I came Number 2 We won $10,000 in US dollar coins! Unfortunately, Spectral Finance has since cut back on competitive sites and leaderboards, but here is a screenshot of when I won.

My username was DS-Clau and I was second in a score of 83.66 (Image by the author)

This experience taught me that understanding business issues is really important. In this post, we'll show you the detailed explanation and exactly how we did it with the Python code snippet. This approach can be replicated for your next machine learning project or competition.

Start: No expensive hardware required

Let me clarify this, You don't necessarily need expensive cloud computing setups to win the ML competition (Unless the dataset is too large to fit locally).

This competition dataset included 77 features and 443k columns, which are by no means small. The data came as a .parquet Files downloaded using duckdb.

I used my personal laptop, a MacBook Pro with 16GB RAM and GPU. The entire dataset fits locally on the laptop, but I have to admit that the training process was a bit slow.

Insight: A clever sampling technique will earn 90% of insights without high computational costs. Many people are threatened by large datasets and think that large cloud instances are needed. You can start the project locally by sampling a portion of the dataset and first examining the sample.

EDA: I know the data

Here, my fintech background has become my superpower and I approached this in the same way as any other credit risk issue.

First Question for Credit Scoring: What is the distribution of classes?

I trembled when I saw the split on 62/38…38% very The default rate is high from a business standpoint, but fortunately, competition wasn't the pricing of this product.

Next, I wanted to see which features are actually important.

This is what made me excited. The pattern was exactly what I expected from the credit data:

  • risk_factor The most powerful predictor, showing a correlation between target variables > 0.4 > 0.4 (high-risk actors = more likely default)
  • time_since_last_liquidated They were at risk as they showed strong negative correlations, so they were the last time they liquidated recently. High speeds are usually high risk signals, so this line up as expected (recent liquidation = risk)
  • liquidation_count_sum_eth We suggested that borrowers with a higher ETH liquidation count are risk flags (increased liquidation = high risk behavior)

Insight: Looking at Pearson's correlation is a simple and intuitive way to understand the linear relationship between features and target variables. This is a great way to gain intuition as to which features should not be included in the final model, but not intuition.

Features selection: few

When I explain this to them, this is always baffling the executives:

A lot of features doesn't necessarily improve performance.

In fact, too many features mean poor performance and slow training. This is because the additional features add noise. All unrelated features make the model a bit worse by finding the actual pattern.

So choosing a feature is an important step I never skip. We used recursive function removal to find the best number of features. Let's walk you about my exact process:

That's what sweet spot was like 34 Features. Since this point, model performance measured by AUC scores did not improve with additional features. So I trained the model using less than half of the given feature, and went from 77 features to 34.

Insight: This reduction in functionality eliminates noise while maintaining the signal from critical functionality, leading to faster, predictive models.

Building neural networks: Simple yet powerful architecture

Before defining the model architecture, the dataset had to be properly defined.

  1. It is split into training and validation sets (To check the results after model training)
  2. Scale function Because neural networks are very sensitive to outliers
  3. Converts datasets to Pytorch tensors For efficient calculations

Here is my exact data preprocessing pipeline:

Now comes the fun part: building a real neural network model.

Important context: Spectral Finance (The Competition Organizer) limits the deployment of models to neural networks and logistic regression only for zero-knowledge proof systems.

ZK Proof requires mathematical circuits that can encrypt calculations without displaying the underlying data, allowing efficient conversion of neural networks and logistic regression into ZK circuits.

It was my first time building a neural network for credit scoring, so I wanted to keep things simple but effective. Here is my model architecture:

Let's take a closer look at my architecture choices:

  • 5 hidden layers: Depth enough to capture complex patterns, shallow enough to avoid overfitting
  • 64 neurons per layer: A good balance between capacity and calculation efficiency
  • Relu Activation: Standard selection of hidden layers prevents gradient disappearance
  • Dropout (0.2): Prevents overfitting by randomly zeroing 20% ​​of neurons during training
  • Sigmoid output: Optimal for binary classification, outputs probability between 0 and 1

Model Training: Where Magic Happens

Now, for the training loop that starts the model learning process:

Below is a detailed description of the model training process.

  • Early suspension: Prevents overfitting by stopping when validation performance improves
  • Momentum SGD: A simple but effective optimizer choice
  • Verification tracking: It is essential for monitoring not only training loss but also actual performance

The training curve showed steady improvements without overfitting during the training process. This is exactly what I wanted to see.

Solutions to model training losses
Model Training Loss Curve (Image by the Author)

Secret Weapon: Threshold Optimization

Here I was probably better than others in a more complicated model in the competition: I bet Most people submitted their predictions at the default 0.5th threshold.

However, we knew that the default threshold was not optimal due to class imbalances (~38% of loans is the default). Therefore, we used precision analysis to select a better cutoff.

I ultimately maximized my F1 score. This is the harmonic average between accuracy and recall. The optimal threshold based on the highest F1 score was 0.35 instead of 0.5. This single change will improve your competition score by a few percentage points, with a difference between placement and victory.

Insight: In the real world, different types of errors have different costs. If you miss the default, you will lose money. Rejecting a good customer will lose potential benefits. The threshold should reflect this reality and should not be arbitrarily set to 0.5.

Conclusion

This competition strengthened what I had known for a while:

The success of machine learning is not about having the most fancy tools or the most complex algorithms.

It's about understanding your problem, applying a solid foundation, and focusing on what actually moves the needle.

You don't need to earn a PhD or ML competition to become a data scientist.

There is no need to implement the latest research papers.

Also, no expensive cloud resources are required.

All you need is domain knowledge, a robust foundation, and attention to details that others may overlook (such as threshold optimization).


Want to build AI skills?

ai I run AI Weekender featuring a fun weekend AI project and Quick and practical tips to help you build with AI.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *