Building Financial Machine Learning with Scikit-Learn: A Comprehensive Guide | Nutdanai Wangpratham | | April 2023

Machine Learning

Machine learning has revolutionized the field of finance, transforming the way financial institutions operate and make decisions. From stock price prediction and portfolio optimization to credit risk assessment and fraud detection, machine learning models have become essential tools in the financial industry. A popular Python library for machine learning, scikit-learn provides a robust and easy-to-use toolkit for building and training machine learning models. This article walks you through the financial ‘zero to one’ by guiding you through the process of developing effective machine learning models using Scikit-learn. This article provides practical insights and best practices for leveraging his Scikit-learn for finance-related tasks, whether you’re a finance professional or a data scientist. Let’s take a closer look at how you can harness the power of machine learning to drive innovation and success in the financial industry.

A machine learning model was trained using Kaggle’s comprehensive 200+ US Stock Financial Indicators (2014–2018) dataset. The model is designed to identify patterns and differences between stocks that perform strongly and those that do not. By leveraging the knowledge gained from this analysis, the model can predict which stocks are likely to be lucrative investments. However, please note that this article is only a starting point. It may not be perfect, but we plan to continually update and enhance it based on feedback and audience preferences.

I created the code in my notebook. follow it. The first thing to do is import the data. I downloaded the data and saved it to Google Drive.

!gdown --id 10asORoUJ1Sj07bNNbVVWqPAxQhqsVEXw #import Data from Google Drive
!gdown --id 1nJAE7FJeV9cZZiqjpXEawhK41xdkFybb
!gdown --id 1WvpO8OmgFPG-aLM-iDZJ5A6O73u2Lo-b
!gdown --id 1AiRax6xFAq3zTm625xY5Kx91bbRvx-kO
!gdown --id 1we3vPt1ldoR5CksuaiIvEhf4_xdnTkKh

df_2014 = pd.read_csv("2014_Financial_Data.csv")
df_2015 = pd.read_csv("2015_Financial_Data.csv")
df_2016 = pd.read_csv("2016_Financial_Data.csv")
df_2017 = pd.read_csv("2017_Financial_Data.csv")
df_2018 = pd.read_csv("2018_Financial_Data.csv")

Understanding your data is a key step in the machine learning process. It is not advisable to simply fill the model with data without a proper understanding of the model’s characteristics, qualities, and relevance to the problem at hand. Thorough data analysis and a discerning approach are essential.

As responsible practitioners of machine learning, we must pay close attention to data analysis. This includes preprocessing data to handle missing values, normalize data, and deal with outliers. Exploratory data analysis (EDA) techniques can also provide valuable insights and help uncover patterns in your data.

Describe the data using the describe() function.


There are certainly a lot of variables and the data may not be easy to understand. Additionally, some data have different scales and using variables at all is definitely not a good idea. More on this later. Then the target variable (PRICE VAR [%]).

fig = plt.figure(figsize =(10, 7))
# Creating plot

You can see that there are outliers in the data. We recommend clipping outliers.

df_no_outliers = df[df["PRICE VAR [%]"] > df["PRICE VAR [%]"].quantile(.05)  ] & df[df["PRICE VAR [%]"] < df["PRICE VAR [%]"].quantile(.95)  ]

In addition to this method, you can also replace values ​​that exceed the limits defined as limit values.

In the next step, I would like to include multi-year data, perhaps ignoring market impact and using variables as returns. The market return must be subtracted first, and usually the index return is used, but since there is no return here, the median (mean/mod) value may be used instead.

df_2014["Alpha"] =  df_2014["2015 PRICE VAR [%]"] - df_2014["2015 PRICE VAR [%]"].mean()
df_2015["Alpha"] = df_2015["2016 PRICE VAR [%]"] - df_2015["2016 PRICE VAR [%]"].mean()
df_2016["Alpha"] = df_2016["2017 PRICE VAR [%]"] - df_2016["2017 PRICE VAR [%]"].mean()
df_2017["Alpha"] = df_2017["2018 PRICE VAR [%]"] - df_2017["2018 PRICE VAR [%]"].mean()
df_2018["Alpha"] = df_2018["2019 PRICE VAR [%]"] - df_2018["2019 PRICE VAR [%]"].mean()The next step we

The next step is to concatenate the data.

df_all =  pd.concat([df_2014, df_2015, df_2016, df_2017, df_2018], axis=0)

There are 230 columns in total. I had to cut out some variables first because sitting in each one would drive me crazy. process the na data and drop the na as well

data_info = pd.set_option('display.max_columns', None)
data_info=pd.DataFrame(df_all.dtypes).T.rename(index={0:'column type'})
data_info=data_info.append(pd.DataFrame(df_all.isnull().sum()).T.rename(index={0:'null values (nb)'}))
rename(index={0:'null values (%)'}))

Remove variables over 2500.

nas_by_feature = df_all.isnull().sum(axis=0)
features_to_drop = nas_by_feature[nas_by_feature>2500].index
df_all.drop(features_to_drop, axis=1, inplace=True)

At this point, you may create a number of additional variables, such as rankings that are commonly used to do this. Factor investing, but I won’t mention it here.

As in the previous article, there are several ways to find factor importance, but today we will only discuss the basic ones.

First correlation. Correlation refers to a statistical measure of association or relationship between two variables. It shows how closely two variables are related and the direction and strength of their relationship.

correlation = df_all.corr()
correlation["Alpha"].sort_values( ascending=False).head(10)

There is a positive correlation with the alpha variable. You can also find variables with negative correlations.

df_all_fillna = df_all.fillna(df_all.median())

Feature scaling is the most important process because some ML algorithms do not work well when the scales of the input numeric attributes are very different. Scaling has multiple ways. I choose SD today.

from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()

df_ = df_all.iloc[:,:-1]
df_all_std_scaled = std_scaler.fit_transform(df_)

Feature selection is an important step in machine learning and data analysis. The goal is to select a subset of relevant features or variables from a larger set of features to improve model performance, reduce complexity, and increase interpretability. Feature selection is important. This is because not all features are equally important or informative for a particular problem, and including irrelevant or redundant features in a model can lead to overfitting, increased computational cost, and reduced interpretability. because it can connect.

There are many feature selection methods, but today we will only discuss the basic ones.

bestfeatures = SelectKBest(k=10, score_func=f_regression)
fit =,Y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
featureScores.nlargest(10,'Score').set_index('Specs') #print 10 best features

Then Top 10 function to enter data only

df_all_std_scaled = std_scaler.fit_transform(df_)
X = df_all_std_scaled

Train your testing spirit.

validation_size = 0.2

Y = Y.to_numpy()
#In case the data is not dependent on the time series, then train and test split should be done based on sequential sample
#This can be done by selecting an arbitrary split point in the ordered list of observations and creating two new datasets.
train_size = int(len(X) * (1-validation_size))
X_train, X_test = X[0:train_size], X[train_size:len(X)]
Y_train, Y_test = Y[0:train_size], Y[train_size:len(X)]

models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR()))

models.append(('MLP', MLPRegressor()))
# Boosting methods
models.append(('ABR', AdaBoostRegressor()))
models.append(('GBR', GradientBoostingRegressor()))
# Bagging methods
models.append(('RFR', RandomForestRegressor()))
models.append(('ETR', ExtraTreesRegressor()))

Parameter settings:

num_folds = 10
seed = 7
scoring = 'neg_mean_squared_error'

This code defines the number of folds for cross-validation (num_folds), a random seed for reproducibility (seed), and a scoring metric to evaluate the model (scoring). In this case, the negative mean squared error is used as the scoring metric. A negative sign is used to invert the score, with lower values ​​representing better performance.

names = []
kfold_results = []
test_results = []
train_results = []

create an empty list names, kfold_results, test_resultsand train_results Save the model’s name, cross-validation results, test results, and training results, respectively.

creates an instance of a name, a model within a model, and a KFold The specified number of folds (num_folds) to perform k-fold cross-validation on the training data (X_train and Y_train) using cross_val_score() function. Negative mean squared error scores are inverted by multiplying by -1, kfold_results list.

last print rms

for name, model in models:
kfold = KFold(n_splits=num_folds)
cv_results = -1 * cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)

res =, Y_train)
train_result = mean_squared_error(res.predict(X_train), Y_train)
test_result = mean_squared_error(res.predict(X_test), Y_test)
msg = "%s: %f (%f) %f %f" % (name, cv_results.mean(), cv_results.std(), train_result, test_result)

Plot result

# compare algorithms
fig = pyplot.figure()

ind = np.arange(len(names)) # the x locations for the groups
width = 0.35 # the width of the bars

fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111) - width/2, train_results, width=width, label='Train Error') + width/2, test_results, width=width, label='Test Error')

At this point, you may still see sub-optimal results. But we see a pipeline for Financial ML development. There are many steps omitted in detail. If this article gets a good response, I promise to fill in the missing pieces.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *