Machine Learning on GCP: From Notebooks to Pipelines | Written by Benjamin Etienne

Notebooks are not enough for large-scale ML

All images are by the author unless otherwise noted

There are misconceptions (not to mention fantasies) that are constantly repeated within companies when it comes to AI and machine learning. People struggle with the complexity and skills required to bring machine learning projects into production because they don't understand the job, or (even worse) think they do but don't. We often make incorrect decisions about

When you discover AI, your first reaction might be: “AI is actually very simple. You just need a Jupyter Notebook, copy-paste code from here and there, or ask Copilot, and boom.” After all, you don't need to hire a data scientist. There isn’t…” And the story always ends on a note of bitterness, disappointment, and the feeling that AI is a fraud. Difficulty moving to production, data drift, bugs, and undesired behavior.

So let's write this down once and for all. AI, machine learning, and other data-related jobs are real jobs, not hobbies. It requires skill, craftsmanship, and tools. If you think you can use notebooks to run ML in production, you're wrong.

The purpose of this article is to demonstrate in a simple example all the effort, skills, and tools required to move from a notebook to a real pipeline in production. Because ML in production is primarily about being able to automate the execution of code on a regular basis through automation and monitoring.

Also, for those looking for an end-to-end “notebook to apex pipeline” tutorial, this might be useful.

Let's imagine you are a data scientist working for an e-commerce company. Your company sells clothing online and your marketing team wants your help. They are preparing a special offer for a particular product and want to target customers efficiently by customizing the content of the emails pushed to maximize conversions. Therefore, your task is easy. You need to assign each customer a score that represents the probability that they will purchase the product from the special offer.

Special offers are specifically targeted at these brands. In other words, marketing teams want to know which customers will buy the next product from the next brand.

Allegra K, Calvin Klein, Carhartt, Haynes, Volcom, Nautica, Quiksilver, Diesel, Dockers, Harley

This article uses `, a publicly available dataset from Google.thelook_ecommerce` Dataset. It contains fake data about transactions, customer data, product data, and everything else you have at your disposal when working at an online fashion retailer.

This notebook requires access to Google Cloud Platform, but the logic can be replicated to other cloud providers or third parties such as Neptune or MLFlow.

As a good data scientist, you start by creating a notebook to help you explore your data.

First, import the libraries used in this article.

import catboost as cb
import pandas as pd
import sklearn as sk
import numpy as np
import datetime as dtfrom dataclasses import dataclass
from sklearn.model_selection import train_test_split
from google.cloud import bigquery
%load_ext watermark
%watermark --packages catboost,pandas,sklearn,numpy,google.cloud.bigquery

catboost             : 1.0.4
pandas               : 1.4.2
numpy                : 1.22.4
google.cloud.bigquery: 3.2.0

Data acquisition and preparation

Next, use a Python client to load data from BigQuery. Be sure to use your own project ID.

query = """
SELECT 
transactions.user_id,
products.brand,
products.category,
products.department,
products.retail_price,
users.gender,
users.age,
users.created_at,
users.country,
users.city,
transactions.created_at
FROM `bigquery-public-data.thelook_ecommerce.order_items` as transactions
LEFT JOIN `bigquery-public-data.thelook_ecommerce.users` as users
ON transactions.user_id = users.id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.products` as products
ON transactions.product_id = products.id
WHERE status <> 'Cancelled'
"""client = bigquery.Client()
df = client.query(query).to_dataframe()

When you look at the data frame you should see something like this:

These represent transactions/purchases made by customers and are rich in customer and product information.

Our objective is to predict which brand a customer will buy the next time they make a purchase, so we proceed as follows.

Group purchases in chronological order for each customer
If a customer has purchased N times, target the Nth purchase and consider N-1 to be a feature.
Therefore, customers with only one purchase are excluded.

Let's put it into code:

# Compute recurrent customers
recurrent_customers = df.groupby('user_id')['created_at'].count().to_frame("n_purchases")# Merge with dataset and filter those with more than 1 purchase
df = df.merge(recurrent_customers, left_on='user_id', right_index=True, how='inner')
df = df.query('n_purchases > 1')
# Fill missing values
df.fillna('NA', inplace=True)
target_brands = [
'Allegra K', 
'Calvin Klein', 
'Carhartt', 
'Hanes', 
'Volcom', 
'Nautica', 
'Quiksilver', 
'Diesel',
'Dockers', 
'Hurley'
]
aggregation_columns = ['brand', 'department', 'category']
# Group purchases by user chronologically
df_agg = (df.sort_values('created_at')
.groupby(['user_id', 'gender', 'country', 'city', 'age'], as_index=False)[['brand', 'department', 'category']]
.agg({k: ";".join for k in ['brand', 'department', 'category']})
)
# Create the target
df_agg['last_purchase_brand'] = df_agg['brand'].apply(lambda x: x.split(";")[-1])
df_agg['target'] = df_agg['last_purchase_brand'].isin(target_brands)*1
df_agg['age'] = df_agg['age'].astype(float)
# Remove last item of sequence features to avoid target leakage :
for col in aggregation_columns:
df_agg[col] = df_agg[col].apply(lambda x: ";".join(x.split(";")[:-1]))

Notice how we removed the last item in the sequence feature. This is very important. Otherwise, a so-called “data leak” will occur. A target is a part of a feature, and the model is given an answer during training.

I just got this new df_agg Data frame:

Comparing to the original dataframe, we can see that user_id 2 actually purchased IZOD, Parke & Ronen, and finally Orvis, which is not among the target brands.

Split into training, validation, and testing

As an experienced data scientist, it's clear that you need all three to perform rigorous machine learning, so you split your data into different sets. (Cross validation is out of scope for people today. I'll keep it simple.)

One of the important things when partitioning data is to use lesser-known data. stratify scikit-learn parameters train_test_split() Method. The reason is due to class imbalance. If the target distribution (in this case, the ratio of 0 to 1) is different in training and testing, you can experience bad results when deploying the model, which can be frustrating. ML 101 Kids: Make the data distribution between training and testing data as similar as possible.

# Remove unecessary featuresdf_agg.drop('last_purchase_category', axis=1, inplace=True)
df_agg.drop('last_purchase_brand', axis=1, inplace=True)
df_agg.drop('user_id', axis=1, inplace=True)
# Split the data into train and eval
df_train, df_val = train_test_split(df_agg, stratify=df_agg['target'], test_size=0.2)
print(f"{len(df_train)} samples in train")
df_train, df_val = train_test_split(df_agg, stratify=df_agg['target'], test_size=0.2)
print(f"{len(df_train)} samples in train") 
# 30950 samples in train
df_val, df_test = train_test_split(df_val, stratify=df_val['target'], test_size=0.5)
print(f"{len(df_val)} samples in val")
print(f"{len(df_test)} samples in test")
# 3869 samples in train
# 3869 samples in test

This will properly partition the dataset between features and targets.

X_train, y_train = df_train.iloc[:, :-1], df_train['target']
X_val, y_val = df_val.iloc[:, :-1], df_val['target']
X_test, y_test = df_test.iloc[:, :-1], df_test['target']

There are different types of functions. We typically divide these between:

Numerical characteristics: They are continuous and reflect measurable or ordered quantities.
Categorical features: Usually discrete, often represented as strings (e.g. country, color, etc.).
Characteristics of text: Usually a series of words.

Of course, there could be more, such as images, video, audio, etc.

Model: CatBoost Deployment

For classification problems (you already knew that it belongs to the classification framework, right?) we use CatBoost, a simple yet very powerful library. It is built and maintained by Yandex and provides a high-level API for easily working with boosted trees. Although it is similar to XGBoost, it does not work exactly the same way internally.

CatBoost provides nice wrappers to handle different types of functionality. In this example, some features can be considered “text” because they are concatenations of words, such as “Calvin Klein;BCBGeneration;Hanes”. Dealing with this type of functionality can be tedious as it requires the use of text splitters, tokenizers, lemmatizers, etc. I hope CatBoost can manage everything.

# Define features
features = {
'numerical': ['retail_price', 'age'],
'static': ['gender', 'country', 'city'],
'dynamic': ['brand', 'department', 'category']
}# Build CatBoost "pools", which are datasets
train_pool = cb.Pool(
X_train,
y_train,
cat_features=features.get("static"),
text_features=features.get("dynamic"),
)
validation_pool = cb.Pool(
X_val,
y_val,
cat_features=features.get("static"),
text_features=features.get("dynamic"),
)
# Specify text processing options to handle our text features
text_processing_options = {
"tokenizers": [
{"tokenizer_id": "SemiColon", "delimiter": ";", "lowercasing": "false"}
],
"dictionaries": [{"dictionary_id": "Word", "gram_order": "1"}],
"feature_processing": {
"default": [
{
"dictionaries_names": ["Word"],
"feature_calcers": ["BoW"],
"tokenizers_names": ["SemiColon"],
}
],
},
}

Now you are ready to define and train your model. Due to the large number of parameters, reviewing them all is beyond the scope of today, but feel free to explore the API yourself.

For the sake of brevity, we won't perform hyperparameter tuning today, but this is obviously a big part of a data scientist's job.

# Train the model
model = cb.CatBoostClassifier(
iterations=200,
loss_function="Logloss",
random_state=42,
verbose=1,
auto_class_weights="SqrtBalanced",
use_best_model=True,
text_processing=text_processing_options,
eval_metric='AUC'
)model.fit(
train_pool, 
eval_set=validation_pool, 
verbose=10
)

The model is now trained. Is it over already?

no. You need to ensure that your model's performance is consistent between training and testing. A large gap between training and testing means the model is overfitting (i.e. it memorizes the training data and is bad at predicting unseen data). ).

We use the ROC-AUC score to evaluate the model. I won't go into detail about this either, but from my own experience, this is generally a very robust metric, much better than accuracy.

A quick side note on accuracy: We generally do not recommend using this as an evaluation metric. Consider an unbalanced dataset that contains 1% positive and 99% negative datasets. How accurate is a very stupid model that always predicts 0? 99%. So accuracy is of no use here.

from sklearn.metrics import roc_auc_scoreprint(f"ROC-AUC for train set      : {roc_auc_score(y_true=y_train, y_score=model.predict(X_train)):.2f}")
print(f"ROC-AUC for validation set : {roc_auc_score(y_true=y_val, y_score=model.predict(X_val)):.2f}")
print(f"ROC-AUC for test set       : {roc_auc_score(y_true=y_test, y_score=model.predict(X_test)):.2f}")

ROC-AUC for train set      : 0.612
ROC-AUC for validation set : 0.586
ROC-AUC for test set       : 0.622

To be honest, 0.62 AUC is not great at all and is a bit disappointing for professional data scientists. Our model certainly needs a bit of parameter tuning here, and we'll probably need to do more serious feature engineering as well.

But it's already better than random prediction (phew):

# random predictionsprint(f"ROC-AUC for train set      : {roc_auc_score(y_true=y_train, y_score=np.random.rand(len(y_train))):.3f}")
print(f"ROC-AUC for validation set : {roc_auc_score(y_true=y_val, y_score=np.random.rand(len(y_val))):.3f}")
print(f"ROC-AUC for test set       : {roc_auc_score(y_true=y_test, y_score=np.random.rand(len(y_test))):.3f}")

ROC-AUC for train set      : 0.501
ROC-AUC for validation set : 0.499
ROC-AUC for test set       : 0.501

Let's assume for now that you are happy with your model and notebook. This is where amateur data scientists stop. So how can you take the next step and prepare for production?

Introducing Docker

Docker is a set of platform-as-a-service products that uses OS-level virtualization to deliver software in packages called containers. That being said, think of Docker as code that can run anywhere, avoiding situations where “it works on your machine, but not on mine”.

Reasons to use Docker In addition to great things like being able to share your code, keep versions of it, and be able to easily deploy it anywhere, you can also use it to build pipelines. Be patient, you'll figure it out as you go.

The first step in building a containerized application is to refactor and clean up your messy notebooks. Define two files. preprocess.py and train.py As a very simple example, let them be src directory.Including ours requirements.txt A file containing everything.

# src/preprocess.pyfrom sklearn.model_selection import train_test_split
from google.cloud import bigquery
def create_dataset_from_bq():
query = """
SELECT 
transactions.user_id,
products.brand,
products.category,
products.department,
products.retail_price,
users.gender,
users.age,
users.created_at,
users.country,
users.city,
transactions.created_at
FROM `bigquery-public-data.thelook_ecommerce.order_items` as transactions
LEFT JOIN `bigquery-public-data.thelook_ecommerce.users` as users
ON transactions.user_id = users.id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.products` as products
ON transactions.product_id = products.id
WHERE status <> 'Cancelled'
"""
client = bigquery.Client(project='<replace_with_your_project_id>')
df = client.query(query).to_dataframe()
print(f"{len(df)} rows loaded.")
# Compute recurrent customers
recurrent_customers = df.groupby('user_id')['created_at'].count().to_frame("n_purchases")
# Merge with dataset and filter those with more than 1 purchase
df = df.merge(recurrent_customers, left_on='user_id', right_index=True, how='inner')
df = df.query('n_purchases > 1')
# Fill missing value
df.fillna('NA', inplace=True)
target_brands = [
'Allegra K', 
'Calvin Klein', 
'Carhartt', 
'Hanes', 
'Volcom', 
'Nautica', 
'Quiksilver', 
'Diesel',
'Dockers', 
'Hurley'
]
aggregation_columns = ['brand', 'department', 'category']
# Group purchases by user chronologically
df_agg = (df.sort_values('created_at')
.groupby(['user_id', 'gender', 'country', 'city', 'age'], as_index=False)[['brand', 'department', 'category']]
.agg({k: ";".join for k in ['brand', 'department', 'category']})
)
# Create the target
df_agg['last_purchase_brand'] = df_agg['brand'].apply(lambda x: x.split(";")[-1])
df_agg['target'] = df_agg['last_purchase_brand'].isin(target_brands)*1
df_agg['age'] = df_agg['age'].astype(float)
# Remove last item of sequence features to avoid target leakage :
for col in aggregation_columns:
df_agg[col] = df_agg[col].apply(lambda x: ";".join(x.split(";")[:-1]))
df_agg.drop('last_purchase_category', axis=1, inplace=True)
df_agg.drop('last_purchase_brand', axis=1, inplace=True)
df_agg.drop('user_id', axis=1, inplace=True)
return df_agg
def make_data_splits(df_agg):
df_train, df_val = train_test_split(df_agg, stratify=df_agg['target'], test_size=0.2)
print(f"{len(df_train)} samples in train")
df_val, df_test = train_test_split(df_val, stratify=df_val['target'], test_size=0.5)
print(f"{len(df_val)} samples in val")
print(f"{len(df_test)} samples in test")
return df_train, df_val, df_test

# src/train.pyimport catboost as cb
import pandas as pd
import sklearn as sk
import numpy as np
import argparse
from sklearn.metrics import roc_auc_score
def train_and_evaluate(
train_path: str,
validation_path: str,
test_path: str
):
df_train = pd.read_csv(train_path)
df_val = pd.read_csv(validation_path)
df_test = pd.read_csv(test_path)
df_train.fillna('NA', inplace=True)
df_val.fillna('NA', inplace=True)
df_test.fillna('NA', inplace=True)
X_train, y_train = df_train.iloc[:, :-1], df_train['target']
X_val, y_val = df_val.iloc[:, :-1], df_val['target']
X_test, y_test = df_test.iloc[:, :-1], df_test['target']
features = {
'numerical': ['retail_price', 'age'],
'static': ['gender', 'country', 'city'],
'dynamic': ['brand', 'department', 'category']
}
train_pool = cb.Pool(
X_train,
y_train,
cat_features=features.get("static"),
text_features=features.get("dynamic"),
)
validation_pool = cb.Pool(
X_val,
y_val,
cat_features=features.get("static"),
text_features=features.get("dynamic"),
)
test_pool = cb.Pool(
X_test,
y_test,
cat_features=features.get("static"),
text_features=features.get("dynamic"),
)
params = CatBoostParams()
text_processing_options = {
"tokenizers": [
{"tokenizer_id": "SemiColon", "delimiter": ";", "lowercasing": "false"}
],
"dictionaries": [{"dictionary_id": "Word", "gram_order": "1"}],
"feature_processing": {
"default": [
{
"dictionaries_names": ["Word"],
"feature_calcers": ["BoW"],
"tokenizers_names": ["SemiColon"],
}
],
},
}
# Train the model
model = cb.CatBoostClassifier(
iterations=200,
loss_function="Logloss",
random_state=42,
verbose=1,
auto_class_weights="SqrtBalanced",
use_best_model=True,
text_processing=text_processing_options,
eval_metric='AUC'
)
model.fit(
train_pool, 
eval_set=validation_pool, 
verbose=10
)
roc_train = roc_auc_score(y_true=y_train, y_score=model.predict(X_train))
roc_eval  = roc_auc_score(y_true=y_val, y_score=model.predict(X_val))
roc_test  = roc_auc_score(y_true=y_test, y_score=model.predict(X_test))
print(f"ROC-AUC for train set      : {roc_train:.2f}")
print(f"ROC-AUC for validation set : {roc_eval:.2f}")
print(f"ROC-AUC for test.      set : {roc_test:.2f}")
return {"model": model, "scores": {"train": roc_train, "eval": roc_eval, "test": roc_test}}
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--train-path", type=str)
parser.add_argument("--validation-path", type=str)
parser.add_argument("--test-path", type=str)
parser.add_argument("--output-dir", type=str)
args, _ = parser.parse_known_args()
_ = train_and_evaluate(
args.train_path,
args.validation_path,
args.test_path)

It's much cleaner now. You can now actually launch scripts from the command line.

$ python train.py --train-path xxx --validation-path yyy etc.

Now you're ready to build your Docker image. To do this, you need to write a Dockerfile in the root of your project.

# DockerfileFROM python:3.8-slim
WORKDIR /
COPY requirements.txt /requirements.txt
COPY src /src
RUN pip install --upgrade pip && pip install -r requirements.txt
ENTRYPOINT [ "bash" ]

This gets our requirements. src Check the folders and their contents and use pip to install the requirements when building the image.

To build this image and deploy it to a container registry, you need the Google Cloud SDK and gcloud command:

PROJECT_ID = ...
IMAGE_NAME=f'thelook_training_demo'
IMAGE_TAG='latest'
IMAGE_URI='eu.gcr.io/{}/{}:{}'.format(PROJECT_ID, IMAGE_NAME, IMAGE_TAG)!gcloud builds submit --tag $IMAGE_URI .

If everything goes well, you should see something like this:

Vertex Pipelines, moving to production

Docker images are the first step to doing serious machine learning in production. The next step is to build the so-called “pipeline”. A pipeline is a series of operations orchestrated by a framework called Kubeflow. Kubeflow can run on Vertex AI on Google Cloud.

The reasons for preferring pipelines over notebooks in production are debatable, but here are three reasons based on my experience.

Monitoring and reproducibility: Each pipeline is saved with its artifacts (datasets, models, metrics) so runs can be compared, rerun, and audited. Every time you rerun the notebook, you lose the history (or you have to manage the artifacts yourself as well as the logs. Good luck).
cost: Running a notebook means having a machine to run the notebook on. — This machine is expensive and requires heavy-duty virtual machines for large models and huge datasets.
— Remember to switch it off when not in use.
— Or, if you are running other applications without a virtual machine, your local machine may simply crash.
— The Vertex AI pipeline is serverless This means you don't have to manage the underlying infrastructure and only pay for what you use, or run time.
Scalability: Good luck if you run dozens of experiments simultaneously on your local laptop. Roll back to using a VM, scale that VM, and read the bullet points above again.

The last reason to prefer pipelines over notebooks is also subjective and highly debatable, but in my opinion notebooks are simply not designed to run workloads on a schedule. However, it's great for exploring.

At least use a cron job that includes a Docker image, or if you want to do it the right way, use a pipeline. However, never run notebooks in a production environment.

Let's start writing the pipeline components.

# IMPORT REQUIRED LIBRARIES
from kfp.v2 import dsl
from kfp.v2.dsl import (Artifact,
Dataset,
Input,
Model,
Output,
Metrics,
Markdown,
HTML,
component, 
OutputPath, 
InputPath)
from kfp.v2 import compiler
from google.cloud.aiplatform import pipeline_jobs%watermark --packages kfp,google.cloud.aiplatform

kfp                    : 2.7.0
google.cloud.aiplatform: 1.50.0

The first component downloads data from Bigquery and saves it as a CSV file.

The BASE_IMAGE used is the image you built earlier. You can use it to import modules and functions defined in your Docker image. src folder:

@component(
base_image=BASE_IMAGE,
output_component_file="get_data.yaml"
)
def create_dataset_from_bq(
output_dir: Output[Dataset],
):from src.preprocess import create_dataset_from_bq
df = create_dataset_from_bq()
df.to_csv(output_dir.path, index=False)

Next step: Split your data

@component(
base_image=BASE_IMAGE,
output_component_file="train_test_split.yaml",
)
def make_data_splits(
dataset_full: Input[Dataset],
dataset_train: Output[Dataset],
dataset_val: Output[Dataset],
dataset_test: Output[Dataset]):import pandas as pd
from src.preprocess import make_data_splits
df_agg = pd.read_csv(dataset_full.path)
df_agg.fillna('NA', inplace=True)
df_train, df_val, df_test = make_data_splits(df_agg)
print(f"{len(df_train)} samples in train")
print(f"{len(df_val)} samples in train")
print(f"{len(df_test)} samples in test")
df_train.to_csv(dataset_train.path, index=False)
df_val.to_csv(dataset_val.path, index=False)
df_test.to_csv(dataset_test.path, index=False)

Next step: Train the model. Save the model score for display in the next step.

@component(
base_image=BASE_IMAGE,
output_component_file="train_model.yaml",
)
def train_model(
dataset_train: Input[Dataset],
dataset_val: Input[Dataset],
dataset_test: Input[Dataset],
model: Output[Model]
):import json
from src.train import train_and_evaluate
outputs = train_and_evaluate(
dataset_train.path,
dataset_val.path,
dataset_test.path
)
cb_model = outputs['model']
scores = outputs['scores']
model.metadata["framework"] = "catboost" 
# Save the model as an artifact
with open(model.path, 'w') as f: 
json.dump(scores, f)

The final step is to compute the metrics (the metrics are actually computed during model training). This is just necessary, but it helps demonstrate how easy it is to build lightweight components. Note that in this case, we do not build any components from BASE_IMAGE (which can be very large), but only a lightweight image containing the necessary components.

@component(
base_image="python:3.9",
output_component_file="compute_metrics.yaml",
)
def compute_metrics(
model: Input[Model],
train_metric: Output[Metrics],
val_metric: Output[Metrics],
test_metric: Output[Metrics]
):import json
file_name = model.path
with open(file_name, 'r') as file:  
model_metrics = json.load(file)
train_metric.log_metric('train_auc', model_metrics['train'])
val_metric.log_metric('val_auc', model_metrics['eval'])
test_metric.log_metric('test_auc', model_metrics['test'])

There are usually other steps you can include, such as deploying the model as an API endpoint, but this is at a more advanced level and requires creating a separate Docker image to serve the model. I'll cover it next time.

Let's glue the components together.

# USE TIMESTAMP TO DEFINE UNIQUE PIPELINE NAMES
TIMESTAMP = dt.datetime.now().strftime("%Y%m%d%H%M%S")
DISPLAY_NAME = 'pipeline-thelook-demo-{}'.format(TIMESTAMP)
PIPELINE_ROOT = f"{BUCKET_NAME}/pipeline_root/"# Define the pipeline. Notice how steps reuse outputs from previous steps
@dsl.pipeline(
pipeline_root=PIPELINE_ROOT,
# A name for the pipeline. Use to determine the pipeline Context.
name="pipeline-demo"   
)
def pipeline(
project: str = PROJECT_ID,
region: str = REGION, 
display_name: str = DISPLAY_NAME
):
load_data_op = create_dataset_from_bq()
train_test_split_op = make_data_splits(
dataset_full=load_data_op.outputs["output_dir"]
)
train_model_op = train_model(
dataset_train=train_test_split_op.outputs["dataset_train"], 
dataset_val=train_test_split_op.outputs["dataset_val"],
dataset_test=train_test_split_op.outputs["dataset_test"],
)
model_evaluation_op = compute_metrics(
model=train_model_op.outputs["model"]
)
# Compile the pipeline as JSON
compiler.Compiler().compile(
pipeline_func=pipeline,
package_path='thelook_pipeline.json'
)
# Start the pipeline
start_pipeline = pipeline_jobs.PipelineJob(
display_name="thelook-demo-pipeline",
template_path="thelook_pipeline.json",
enable_caching=False,
location=REGION,
project=PROJECT_ID
)
# Run the pipeline
start_pipeline.run(service_account=<your_service_account_here>)

If everything works well, you should see your pipeline in the Vertex UI.

Click on it and you will see different steps.

Despite no-code/low-code enthusiasts saying you don't need to be a developer to do machine learning, data science is a real job. Like all jobs, it requires skills, concepts, and tools that go beyond a notebook.

For aspiring data scientists, the reality of the job is:

Have fun coding!

Source link