From dataset to dataframe to deployment: your first project with Pandas and Scikit-learn

Image by editor

# introduction

Want to start your first manageable machine learning project using popular Python libraries? panda and Scikit-Learnbut don’t know where to start? Look no further.

This article describes a gentle machine learning project for beginners in which we work together to build a regression model that predicts employee income based on socio-economic attributes. Along the way, you’ll learn some important machine learning concepts and key tricks.

# From raw dataset to clean dataframe

As with any Python-based project, we recommend starting by importing the necessary libraries, modules, and components that you will use throughout the process.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import joblib

The next step is to load the public datasets in this repository into Pandas. DataFrame Object: A neat data structure for loading, analyzing, and managing fully structured, or tabular, data. Once loaded, examine the attribute’s basic properties and data type.

url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/main/employees_dataset_with_missing.csv"
df = pd.read_csv(url)
print(df.head())
print(df.info())

Although the dataset contains 1,000 entries or instances (that is, data representing 1,000 employees), you can see that most attributes, such as age and income, have fewer than 1,000 actual values. why? because This dataset has missing valuesThis is a common problem with real-world data and needs to be addressed.

In our project, we set the goal of predicting an employee’s income based on the remaining attributes.. Therefore, we take the approach of discarding rows (employees) that have a missing value for this attribute. For predictor attributes, it may be okay to handle missing values and estimate or impute them, but for target variables, you need perfectly known labels to train the machine learning model. The problem is that machine learning models learn by being exposed to examples with known predicted outputs.

There are also specific steps to check only for missing values.

So let’s clean us up DataFrame Exempted from missing values for target variable: Income. This code specifically removes entries with missing values for that attribute.

target = "income"
train_df = df.dropna(subset=[target])

X = train_df.drop(columns=[target])
y = train_df[target]

So what happens to the missing values for the remaining attributes? We’ll get to that in a moment, but first we need to split our dataset into two major subsets. A training set to train the model, and a test set to evaluate the model’s performance after training, consisting of different examples than the model saw during training. Scikit-learn provides a single instruction to do this split randomly.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The next step is to go a step further and build a preprocessing pipeline to transform the data into a format suitable for training machine learning models. This preprocessing typically requires distinguishing between numerical and categorical features, so that each type of feature is targeted to a different preprocessing task along the pipeline. For example, numeric features are typically scaled, while categorical features may be mapped or encoded to numeric features so that machine learning models can digest them. For illustration purposes, the code below shows the complete process of building a preprocessing pipeline. This includes automatic identification of numeric and categorical features, so each type can be handled correctly.

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(exclude=["int64", "float64"]).columns

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

To learn more about data preprocessing pipelines, see this article.

This pipeline is DataFrameyou get a clean, ready-to-use version for machine learning. However, in the next step we will apply this to encapsulate both data preprocessing and machine learning model training into one comprehensive pipeline.

# From clean data frames to ready-to-deploy models

Next, define a comprehensive pipeline like this:

Apply the previously defined pretreatment process. preprocessor Variables — both numeric and categorical attributes.
Train a regression model, a random forest regression, to predict income using the preprocessed training data.

model = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(random_state=42))
])

model.fit(X_train, y_train)

Importantly, the training stage only receives the training subset previously created during the split, rather than the entire dataset.

Now we’ll take another subset of the data, the test set, and use it to evaluate the model’s performance on these example employees. We use the mean absolute error (MAE) as the evaluation metric.

preds = model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
print(f"\nModel MAE: {mae:.2f}")

Considering that most incomes are in the 60,000-90,000 range, your MAE value could be around 13,000, which is acceptable but not great. Not bad for a first machine learning model anyway.

Finally, we’ll show you how to save the trained model to a file for future deployment.

joblib.dump(model, "employee_income_model.joblib")
print("Model saved as employee_income_model.joblib")

trained model .joblib This is useful for future deployments as the files can be immediately reloaded and reused without having to be trained from scratch again. Think of this as “freezing” all preprocessing pipelines and trained models into portable objects. Quick options for future use and deployment include plugging into a simple Python script or notebook, or building lightweight web apps built using tools such as: stream light, gladioor flask.

# summary

In this article, we built regression, an introductory machine learning model for predicting employee income, and outlined the necessary steps to go from a raw dataset to a clean, preprocessed dataset. DataFrameand from DataFrame A model that can be introduced immediately.

Ivan Palomares Carrascosa I am a leader, writer, speaker, and advisor in AI, machine learning, deep learning, and LLM. He trains and coaches others to leverage AI in the real world.

Source link