Apply software design principles to machine learning model development
Software design principles are general guides for developing clean, readable, and maintainable code. Design principles are important because they provide best practices that make your code easier to understand, reuse, extend, and test. Writing code without incorporating at least some of the known best practices can result in code that is difficult to understand. If your code is hard to read, it’s even harder to explain, change, and maintain. Many design principles are used in software development. In general, these principles can be grouped into three buckets: clarity, maintainability, and collaboration.
Key concepts under the clarity bucket include readability, KISS (Keep it Simple Stupid), Don’t Reply Yourself (DRY), and modularity. Beneath the maintenance bucket are Single Responsibility Principle (SRP), testability, and error handling. Under the collaboration bucket are version control and documentation. Clearly, these categories have significant overlap. For example, modular code is usually SRP compliant. Nevertheless, it is helpful to consider each of these concepts independently.
Clarity
There are many ways to improve your code readability, especially in Python. This includes descriptive naming, consistent indentation, dividing complex tasks, minimizing long lines of code, grouping related code, and more. Modularity It also helps with code readability and clarity. Modularization involves grouping similar code logic using modules, classes, methods, and functions. This also makes code easier to maintain, reuse, and share with other developers. Without modularity in your code, functionality can be tightly coupled and overly complex. This can make adding or debugging new features difficult or impossible.
simplicity, Also known as “keep it simple idiot” (kiss) As a rule, the code contains clear and concise logic that highlights the task being performed. This means using simpler algorithms or using built-in functions instead of writing algorithms from scratch. It also means that you write fewer lines of code to accomplish your tasks. finally, dry It’s pretty self-explanatory. Duplicate logic in your code can be very confusing and makes your code more difficult to maintain. For this reason, it’s a good practice to avoid duplicating logic in your code.
Maintainability
of Single Responsibility Principle (SRP) It states that every class, function or module must have a single task associated with it. This further enhances code clarity and makes code much easier to maintain. For example, changing a function that does one thing is much easier than changing a function that does 10 things. SRP is also useful when TestabilityThis includes writing code that is easy to debug, update, and maintain. It is also modular and loosely coupled. That is, the modular parts are independent of each other. Testable code also makes it easier to collaborate with other developers. Then write code to do the right thing. error handling It also makes your code easier to debug and maintain.
collaboration
version control Code changes involve managing and tracking changes made to code over time. This is important for documenting code changes, collaborating with other developers who have made similar changes, reverting to previous versions if necessary, etc. finally, documentation It’s a very important part of code collaboration. Good code documentation makes it easier to understand the logic used in your code, making it easier to maintain, read, and share.
This post explains how to implement these best practices when writing code for machine learning model development. For our purposes, we use synthetic credit card transaction data available in DataFabrica. Data includes synthetic credit card transaction amounts, credit card information, transaction IDs, and more. The free tier is free to download, modify and share under the Apache 2.0 license.
First, let’s read the data into a pandas dataframe and display the first 5 rows of the data.
df = pd.read_csv("synthetic_transaction_data_Dining.csv")
print(df.head())
Clarity of machine learning code
readability
Let’s build a Catboost fraud classifier using credit card transaction data. Here’s an example of poorly written code that reads data, splits the data for training and testing, and evaluates model performance.
Let’s import the package first.
import pandas as p
import catboost as cb
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import classification_report as cr
Now let’s load the dataset d and define the features ‘f’ and the target ‘t’.
d=p.read_csv('dining/synthetic_transaction_data_Dining.csv')
f=['cardholder_name', 'card_number', 'card_type', 'merchant_name', 'merchant_category',
'merchant_state', 'merchant_city', 'transaction_amount', 'merchant_category_code']
t='fraud_flag'
Then split the data for training and testing using the training test split ‘tts’.
X_t,X_ts,y_t,y_ts=tts(d[f],d[t],test_size=0.2,random_state=42)
Next, define the Catboost model, generate predictions, and evaluate its performance.
cf = ['cardholder_name', 'card_number', 'card_type', 'merchant_name', 'merchant_category',
'merchant_state', 'merchant_city', 'merchant_category_code']
m=cb.CatBoostClassifier(iterations=10, cat_features=cf)m.fit(X_t,y_t)
p= m.predict(X_ts)
print(cr(y_ts,p))
Here is the full code:
import pandas as p
import catboost as cb
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import classification_report as crd=p.read_csv('dining/synthetic_transaction_data_Dining.csv')
f=['cardholder_name', 'card_number', 'card_type', 'merchant_name', 'merchant_category',
'merchant_state', 'merchant_city', 'transaction_amount', 'merchant_category_code']
t='fraud_flag'
cf = ['cardholder_name', 'card_number', 'card_type', 'merchant_name', 'merchant_category',
'merchant_state', 'merchant_city', 'merchant_category_code']
X_t,X_ts,y_t,y_ts=tts(d[f],d[t],test_size=0.2,random_state=42)
m=cb.CatBoostClassifier(iterations=10, cat_features=cf)
m.fit(X_t,y_t)
p= m.predict(X_ts)
print(cr(y_ts,p))
This code is less readable due to obscure renaming of imported packages and poor naming of variables. Let’s improve this. Import the package and leave the package name as is.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from catboost import CatBoostClassifier
Now let’s load the data, define the inputs/outputs and split them for training and testing using clear variable names.
data = pd.read_csv('dining/synthetic_transaction_data_Dining.csv')features = ['cardholder_name', 'card_number', 'card_type', 'merchant_name', 'merchant_category',
'merchant_state', 'merchant_city', 'transaction_amount', 'merchant_category_code']
target = 'fraud_flag'
X_train, X_test, y_train, y_test = train_test_split(data[features], data[target], test_size=0.2, random_state=42)
It also generates predictions and performance calculations and names each variable appropriately.
categories = ['cardholder_name', 'card_number', 'card_type', 'merchant_name', 'merchant_category',
'merchant_state', 'merchant_city', 'merchant_category_code']
model = CatBoostClassifier(iterations=10, cat_features=categroies)
model.fit(X_train, y_train)predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
Here is the full code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from catboost import CatBoostClassifierdata = pd.read_csv('dining/synthetic_transaction_data_Dining.csv')
features = ['cardholder_name', 'card_number', 'card_type', 'merchant_name', 'merchant_category',
'merchant_state', 'merchant_city', 'transaction_amount', 'merchant_category_code']
target = 'fraud_flag'
X_train, X_test, y_train, y_test = train_test_split(data[features], data[target], test_size=0.2, random_state=42)
categories = ['cardholder_name', 'card_number', 'card_type', 'merchant_name', 'merchant_category',
'merchant_state', 'merchant_city', 'merchant_category_code']
model = CatBoostClassifier(iterations=10, cat_features=categories)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
In the code above, the package name and variable names are clear and easy to understand. Variable names make your code easier to read because they contain information about what information is stored.
Keep It Simple Fool (KISS)
KISS emphasizes simplicity of code development. This means that the aim is to make the solution to the problem as simple as possible. In the context of machine learning model development, this means:
- choose a simple algorithm: Choosing a relatively simple algorithm is an example of KISS observance. For example, choose tree-based models instead of deep neural networks.
- Selecting a subset of features: Try to limit the number of features you use based on your EDA and domain expertise. Don’t try to use all columns available as features. This makes the model more explainable, prevents overfitting, and leads to better performance.
- Avoiding over-parameterization: Limit computationally intensive hyperparameter tuning. The default parameters are often sufficient for model performance. Additionally, if you choose to tune hyperparameters, make sure you understand the implications of tuning each hyperparameter. Use your domain expertise and experimentation to select the most important subset of hyperparameters you want to tune.
Let’s extend this with code that violates KISS. Construct a neural network credit card fraud classifier. Let’s start by importing the required packages.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix
import tensorflow as tf
import plotly.graph_objects as go
Then we load the data, define the inputs/outputs, encode the categorical features, and split the data for training and testing.
data = pd.read_csv('dining/synthetic_transaction_data_Dining.csv')input_features = ['cardholder_name', 'card_number', 'card_type', 'merchant_name', 'merchant_category',
'merchant_state', 'merchant_city', 'transaction_amount', 'merchant_category_code']
output_variable = 'fraud_flag'
for feature in input_features:
if data[feature].dtype == 'object':
data[feature] = data[feature].astype('category').cat.codes
X = data[input_features]
y = data[output_variable]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Then perform some data transformations.
X_train['transaction_amount'] = np.log1p(X_train['transaction_amount'])
X_test['transaction_amount'] = np.log1p(X_test['transaction_amount'])scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Define a function that specifies the neural network architecture. I’m intentionally overcomplicating things here.
def build_model():
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(5, activation='relu', input_shape=(len(input_features),)))
model.add(tf.keras.layers.Dense(5, activation='relu'))
model.add(tf.keras.layers.Dense(5, activation='relu'))
model.add(tf.keras.layers.Dense(5, activation='relu'))
model.add(tf.keras.layers.Dense(5, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
Do a grid search to find the best fit model.
# Create the model
model = tf.keras.wrappers.scikit_learn.KerasClassifier(build_model)# Define the hyperparameter grid for grid search
param_grid = {
'epochs': [10, 20],
'batch_size': [32, 64],
}
# Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid=param_grid, cv=3)
grid_search.fit(X_train_scaled, y_train)
# Get the best model
best_model = grid_search.best_estimator_
Then you can generate predictions, compute accuracy, compute accuracy, and generate a confusion matrix.
y_pred = best_model.predict(X_test_scaled)accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)
And visualize the confusion matrix.
# Define labels
labels = ['Not Fraud', 'Fraud']# Plot the confusion matrix heatmap
sns.heatmap(confusion_mat, annot=True, fmt="d", cmap="YlGnBu", xticklabels=labels, yticklabels=labels)
# Set axis labels and title
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
# Display the plot
plt.show()
There are many ways to simplify this code using the KISS principle. Let’s choose a simpler algorithm first. Catboost models (or any tree-based model) can be used instead of neural networks. Catboost is also nice in that the categorical column can be passed directly without encoding, which makes the code simpler. Let’s import the required packages.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier, Pool
Prepare data for training.
data = pd.read_csv('dining/synthetic_transaction_data_Dining.csv')input_features = ['cardholder_name', 'card_number', 'card_type', 'merchant_name',
'merchant_category', 'merchant_state', 'merchant_city',
'transaction_amount', 'merchant_category_code']
categories = ['cardholder_name', 'card_number', 'card_type', 'merchant_name',
'merchant_category', 'merchant_state', 'merchant_city'
, 'merchant_category_code']
output_variable = 'fraud_flag'
X_train, X_test, y_train, y_test = train_test_split(data[input_features], data[output_variable], test_size=0.2, random_state=42)
We chose the Catboost model, which simplifies the code by eliminating the need to encode categories or scale the data.
Now let’s perform feature selection using the training data. Catboost makes it easy to use training data for feature selection, but doing this with a neural network is much more complicated.
model = CatBoostClassifier(iterations=60, depth=6, learning_rate=0.1, random_state=42, verbose=0)
train_pool = Pool(X_train, y_train, cat_features=cats)
model.fit(train_pool)importance = model.get_feature_importance(train_pool)
feature_importance_df = pd.DataFrame({'Feature': input_features, 'Importance': importance})
selected_features = feature_importance_df[feature_importance_df['Importance'] > 0].sort_values(by='Importance')['Feature'].tolist()
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]
A model can then be trained based on the selected features to generate feature importance.
selected_categories = [x for x in selected_features if x!= 'transaction_amount']model = CatBoostClassifier(iterations=60, depth=6, learning_rate=0.1, cat_features=selected_categories, random_state=42, verbose=0)
model.fit(X_train_selected, y_train)
y_pred = model.predict(X_test_selected)
importance = model.get_feature_importance(train_pool)
feature_importance_df = pd.DataFrame({'Feature': selected_features, 'Importance': importance})
Then you can calculate performance, visualize confusion matrices, and visualize feature importance.
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)confusion_mat = confusion_matrix(y_test, y_pred)
sns.heatmap(confusion_mat, annot=True, fmt="d", cmap="YlGnBu")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance')
plt.show()
We find that our model captures true fraud cases better, despite being a simpler algorithm that uses fewer data transformations and functions. This is what KISS is all about!
Don’t Reply Yourself (DRY) and Modularity
Both DRY and code modularity focus on limiting code duplication. DRY focuses on extracting common functionality to define reusable components. Similarly, modularity focuses on organizing code so that complex tasks are broken down into simple, independent components. Both can be used to guide code refactoring. Consider the following code.
import pandas as pd
import catboostdata = pd.read_csv('dining/synthetic_transaction_data_Dining.csv')
input_features1 = ['merchant_category_code', 'merchant_name', 'transaction_amount', 'card_type', 'cardholder_name', 'merchant_state', 'merchant_city']
input_features2 = ['merchant_category_code', 'merchant_name', 'transaction_amount', ]
output_variable = 'fraud_flag'
categories1 = [x for x in input_features1 if x!='transaction_amount']
categories2 = [x for x in input_features2 if x!='transaction_amount']
X_train = data.sample(frac=0.8, random_state=42)
y_train = X_train[output_variable]
X_train = X_train[input_features1]
X_test = data.drop(X_train.index)
y_test = X_test[output_variable]
X_test = X_test[input_features1]
model1 = catboost.CatBoostClassifier(iterations = 5, cat_features=categories1, random_state=42)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
X_train = data.sample(frac=0.8, random_state=42)
y_train = X_train[output_variable]
X_train = X_train[input_features2]
X_test = data.drop(X_train.index)
y_test = X_test[output_variable]
X_test = X_test[input_features2]
model2 = catboost.CatBoostClassifier(iterations = 100, cat_features=categories2, random_state=42)
model2.fit(X_train, y_train)
y_pred2 = model2.predict(X_test)
accuracy1 = (y_pred1 == y_test).mean()
print("Accuracy1:", accuracy1)
accuracy2 = (y_pred2 == y_test).mean()
print("Accuracy2:", accuracy2)
This code violates DRY in several ways. Input definition, model definition and training, and model accuracy calculations are all duplicated. This code can be refactored with a function that removes duplicate logic as follows:
import pandas as pd
import catboostdata = pd.read_csv('dining/synthetic_transaction_data_Dining.csv')
def train_model(X_train, y_train, categories, iterations):
model = catboost.CatBoostClassifier(iterations=iterations, cat_features=categories, random_state=42)
model.fit(X_train, y_train)
return model
def evaluate_model(model, X_test, y_test):
y_pred = model.predict(X_test)
accuracy = (y_pred == y_test).mean()
return accuracy
def split_data(data, input_features, output_variable):
X_train = data.sample(frac=0.8, random_state=42)
y_train = X_train[output_variable]
X_train = X_train[input_features]
X_test = data.drop(X_train.index)
y_test = X_test[output_variable]
X_test = X_test[input_features]
return X_train, X_test, y_train, y_test
input_features1 = ['merchant_category_code', 'merchant_name', 'transaction_amount', 'card_type', 'cardholder_name', 'merchant_state', 'merchant_city']
input_features2 = ['merchant_category_code', 'merchant_name', 'transaction_amount']
output_variable = 'fraud_flag'
categories1 = [x for x in input_features1 if x != 'transaction_amount']
categories2 = [x for x in input_features2 if x != 'transaction_amount']
X_train, X_test, y_train, y_test = split_data(data, input_features1, output_variable)
model1 = train_model(X_train, y_train, categories1, iterations=5)
accuracy1 = evaluate_model(model1, X_test, y_test)
print("Accuracy1:", accuracy1)
X_train, X_test, y_train, y_test = split_data(data, input_features2, output_variable)
model2 = train_model(X_train, y_train, categories2, iterations=100)
accuracy2 = evaluate_model(model2, X_test, y_test)
print("Accuracy2:", accuracy2)
Both perform the same task. The latter is easier to read, debug, and maintain. By grouping similar code logic into separate files, you can further increase modularity.
Maintainability
Single Responsibility Principle (SRP) It states that each module, class, method or function should have a single responsibility. The modular version of this code adheres well to this principle.
Testability
Testability is the extent to which code can be tested and verified to ensure performance and correctness. You can make your code more testable by adding a main function to the module version of the previous code.
import pandas as pd
import catboostdef train_model(X_train, y_train, categories, iterations):
model = catboost.CatBoostClassifier(iterations=iterations, cat_features=categories, random_state=42)
model.fit(X_train, y_train)
return model
def evaluate_model(model, X_test, y_test):
y_pred = model.predict(X_test)
accuracy = (y_pred == y_test).mean()
return accuracy
def split_data(data, input_features, output_variable):
X_train = data.sample(frac=0.8, random_state=42)
y_train = X_train[output_variable]
X_train = X_train[input_features]
X_test = data.drop(X_train.index)
y_test = X_test[output_variable]
X_test = X_test[input_features]
return X_train, X_test, y_train, y_test
def main():
data = pd.read_csv('dining/synthetic_transaction_data_Dining.csv')
input_features1 = ['merchant_category_code', 'merchant_name', 'transaction_amount', 'card_type', 'cardholder_name', 'merchant_state', 'merchant_city']
input_features2 = ['merchant_category_code', 'merchant_name', 'transaction_amount']
output_variable = 'fraud_flag'
categories1 = [x for x in input_features1 if x != 'transaction_amount']
categories2 = [x for x in input_features2 if x != 'transaction_amount']
X_train, X_test, y_train, y_test = split_data(data, input_features1, output_variable)
model1 = train_model(X_train, y_train, categories1, iterations=5)
accuracy1 = evaluate_model(model1, X_test, y_test)
print("Model 1 Accuracy:", accuracy1)
X_train, X_test, y_train, y_test = split_data(data, input_features2, output_variable)
model2 = train_model(X_train, y_train, categories2, iterations=100)
accuracy2 = evaluate_model(model2, X_test, y_test)
print("Model 2 Accuracy:", accuracy2)
if __name__ == "__main__":
main()
error handling
Error handling is also important for maintainability. Catch and handle errors to prevent program crashes. You can also add error handling to your code.
import pandas as pd
import catboostdef train_model(X_train, y_train, categories, iterations):
try:
model = catboost.CatBoostClassifier(iterations=iterations, cat_features=categories, random_state=42)
model.fit(X_train, y_train)
return model
except Exception as e:
print(f"An error occurred while training the model: {e}")
def evaluate_model(model, X_test, y_test):
try:
y_pred = model.predict(X_test)
accuracy = (y_pred == y_test).mean()
return accuracy
except Exception as e:
print(f"An error occurred while evaluating the model: {e}")
def split_data(data, input_features, output_variable):
try:
X_train = data.sample(frac=0.8, random_state=42)
y_train = X_train[output_variable]
X_train = X_train[input_features]
X_test = data.drop(X_train.index)
y_test = X_test[output_variable]
X_test = X_test[input_features]
return X_train, X_test, y_train, y_test
except Exception as e:
print(f"An error occurred while splitting the data: {e}")
def main():
try:
data = pd.read_csv('dining/synthetic_transaction_data_Dining.csv')
except Exception as e:
print(f"An error occurred while reading the data: {e}")
return
input_features1 = ['merchant_category_code', 'merchant_name', 'transaction_amount', 'card_type', 'cardholder_name', 'merchant_state', 'merchant_city']
input_features2 = ['merchant_category_code', 'merchant_name', 'transaction_amount']
output_variable = 'fraud_flag'
categories1 = [x for x in input_features1 if x != 'transaction_amount']
categories2 = [x for x in input_features2 if x != 'transaction_amount']
try:
X_train, X_test, y_train, y_test = split_data(data, input_features1, output_variable)
except Exception as e:
print(f"An error occurred while splitting data for Model 1: {e}")
return
try:
model1 = train_model(X_train, y_train, categories1, iterations=5)
accuracy1 = evaluate_model(model1, X_test, y_test)
print("Model 1 Accuracy:", accuracy1)
except Exception as e:
print(f"An error occurred while training or evaluating Model 1: {e}")
return
try:
X_train, X_test, y_train, y_test = split_data(data, input_features2, output_variable)
except Exception as e:
print(f"An error occurred while splitting data for Model 2: {e}")
return
try:
model2 = train_model(X_train, y_train, categories2, iterations=100)
accuracy2 = evaluate_model(model2, X_test, y_test)
print("Model 2 Accuracy:", accuracy2)
except Exception as e:
print(f"An error occurred while training or evaluating Model 2: {e}")
return
if __name__ == "__main__":
main()
Collaboration: version control and documentation
Version control is one of the most important aspects of software development. Code changes are tracked continuously, giving you a history of bug fixes, enhancements, feature additions, and other changes. Consider the DRY/modularity example from earlier. A data scientist or engineer might be assigned the task of refactoring the original code with duplicated logic into a more modular version. It’s important to track these kinds of changes and have access to each code version as changes are made. Common version control platforms include Git, Subversion, and Mercurial.
Besides version control, documentation is also very important. This is the process of providing information that helps explain the code logic. Documentation examples include comments and documentation strings. Code comments are simply text within a line of code that describes a line or block of code.
A doc-string is a Python string literal that provides information about modules, classes, methods and functions. It usually provides information about what the code does, what inputs it needs, and what it outputs. Reading, debugging, and collaborating on code is much easier with doc-strings.
Add comments and docstrings to the code below.
import pandas as pd
import catboostdef train_model(X_train, y_train, categories, iterations):
"""
Trains a CatBoostClassifier model.
Args:
X_train (DataFrame): The input features of the training data.
y_train (Series): The target variable of the training data.
categories (list): List of categorical feature names.
iterations (int): The number of iterations for training the model.
Returns:
CatBoostClassifier: The trained CatBoost model.
"""
try:
model = catboost.CatBoostClassifier(iterations=iterations, cat_features=categories, random_state=42)
model.fit(X_train, y_train)
return model
except Exception as e:
print(f"An error occurred while training the model: {e}")
def evaluate_model(model, X_test, y_test):
"""
Evaluates a trained model on test data.
Args:
model (CatBoostClassifier): The trained model to evaluate.
X_test (DataFrame): The input features of the test data.
y_test (Series): The target variable of the test data.
Returns:
float: The accuracy of the model on the test data.
"""
try:
y_pred = model.predict(X_test)
accuracy = (y_pred == y_test).mean()
return accuracy
except Exception as e:
print(f"An error occurred while evaluating the model: {e}")
def split_data(data, input_features, output_variable):
"""
Splits the data into train and test sets.
Args:
data (DataFrame): The input data.
input_features (list): List of input feature names.
output_variable (str): The name of the target variable.
Returns:
DataFrame: X_train - The input features of the training data.
DataFrame: X_test - The input features of the test data.
Series: y_train - The target variable of the training data.
Series: y_test - The target variable of the test data.
"""
try:
X_train = data.sample(frac=0.8, random_state=42)
y_train = X_train[output_variable]
X_train = X_train[input_features]
X_test = data.drop(X_train.index)
y_test = X_test[output_variable]
X_test = X_test[input_features]
return X_train, X_test, y_train, y_test
except Exception as e:
print(f"An error occurred while splitting the data: {e}")
def main():
"""
The main function that orchestrates the model training and evaluation.
"""
#read in data
try:
data = pd.read_csv('dining/synthetic_transaction_data_Dining.csv')
except Exception as e:
print(f"An error occurred while reading the data: {e}")
return
#define input features
input_features1 = ['merchant_category_code', 'merchant_name', 'transaction_amount', 'card_type', 'cardholder_name', 'merchant_state', 'merchant_city']
input_features2 = ['merchant_category_code', 'merchant_name', 'transaction_amount']
output_variable = 'fraud_flag'
#define categories
categories1 = [x for x in input_features1 if x != 'transaction_amount']
categories2 = [x for x in input_features2 if x != 'transaction_amount']
#try splitting data
try:
X_train, X_test, y_train, y_test = split_data(data, input_features1, output_variable)
except Exception as e:
print(f"An error occurred while splitting data for Model 1: {e}")
return
#try training first model
try:
model1 = train_model(X_train, y_train, categories1, iterations=5)
accuracy1 = evaluate_model(model1, X_test, y_test)
print("Model 1 Accuracy:", accuracy1)
except Exception as e:
print(f"An error occurred while training or evaluating Model 1: {e}")
return
#try splitting second set of features
try:
X_train, X_test, y_train, y_test = split_data(data, input_features2, output_variable)
except Exception as e:
print(f"An error occurred while splitting data for Model 2: {e}")
return
#try training second model
try:
model2 = train_model(X_train, y_train, categories2, iterations=100)
accuracy2 = evaluate_model(model2, X_test, y_test)
print("Model 2 Accuracy:", accuracy2)
except Exception as e:
print(f"An error occurred while training or evaluating Model 2: {e}")
return
if __name__ == "__main__":
main()
The code used in this post is available on GitHub.
Conclusion
In this post, we discussed some software design principles that help with code readability, maintainability, and collaboration when developing machine learning models. We discussed code clarity, maintainability, and collaboration. In terms of clarity, we’ve covered clear variable naming, KISS, DRY, and modularity. All of this helps make code clearer to engineers and data scientists. Maintainability covered SRP, testability, and error handling, which makes code much easier to maintain. Finally, we discussed how version control and documentation work together. By leveraging these principles, data scientists and engineers can work together more effectively.
A free sample of the data used in this article is available here. The full dataset can be found here.
