Fine-tuning a visual language model to generate fashion product descriptions using SageMaker and Amazon Bedrock

Machine Learning


In the world of online retail, creating high-quality product descriptions for millions of products is a crucial yet time-consuming task. Automating product description generation using machine learning (ML) and natural language processing (NLP) has the potential to reduce manual efforts and transform the way e-commerce platforms operate. One of the main benefits of high-quality product descriptions is improved searchability. Customers can more easily find products with the right description because search engines can identify products that match not only the general category but also the specific attributes listed in the product description. For example, if a consumer is looking for a “long-sleeved cotton shirt,” products with descriptions containing words like “long sleeve” and “cotton neck” will be returned. Additionally, having factoid product descriptions allows for a more personalized buying experience and improves algorithms that recommend more relevant products to users, increasing the likelihood that users will purchase, thus increasing customer satisfaction.

Advances in generative AI make it possible to predict product attributes directly from images using vision language models (VLMs). Pre-trained image captioning or visual question answering (VQA) models work well for describing everyday images, but they fail to capture the domain-specific nuances of e-commerce products required to achieve satisfactory performance across all product categories. To solve this problem, this post shows how to predict domain-specific product attributes from product images by fine-tuning a VLM on a fashion dataset using Amazon SageMaker and generating product descriptions using the predicted attributes as input using Amazon Bedrock. I'm sharing the code in my GitHub repository so you can follow along.

Amazon Bedrock is a fully managed service that provides a selection of high-performance foundational models (FM) from leading AI companies, including AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon, through a single API. The capabilities you need to build generative AI applications with security, privacy, and responsible AI.

As explained in Automating Product Description Generation with Amazon Bedrock, you can use managed services such as Amazon Rekognition to predict product attributes. However, if you are trying to extract product or domain (industry) details and characteristics, you will need to fine-tune the VLM in Amazon SageMaker.

Visual Language Model

Since 2021, there has been an increased interest in visual language models (VLMs), with the release of solutions such as contrastive language image pre-training (CLIP) and bootstrapped language image pre-training (BLIP). When it comes to tasks such as image captioning, text-guided image generation, and visual question answering, VLMs have demonstrated state-of-the-art performance.

This post uses BLIP-2. This is introduced in BLIP-2: Bootstrapping Language Image Pretraining with Frozen Image Encoders and Large-Scale Language Models. BLIP-2 consists of his three models: a CLIP-like image encoder, a query transformer (Q-Former), and a large-scale language model (LLM). As LLM, we use a version of BLIP-2 that includes Flan-T5-XL.

The following diagram shows an overview of BLIP-2.

Blip-2 architecture

Figure 1: BLIP-2 overview

Pre-trained versions of the BLIP-2 model are available in Build generative image-to-text AI applications using multimodality models with Amazon SageMaker and Build generative AI-based content moderation solutions with Amazon SageMaker JumpStart. It has been proven that This post shows how to fine-tune his BLIP-2 for domain-specific use cases.

Solution overview

The following diagram shows the solution architecture:

Solution Architecture

Figure 2: High-level solution architecture

Here's an overview of the solution:

  • ML scientists use Sagemaker notebooks to process the data and split it into training and validation data.
  • Datasets are uploaded to Amazon Simple Storage Service (Amazon S3) using an S3 client (a wrapper around HTTP calls).
  • Next, we launch the Sagemaker Training job using the Sagemaker client, which is also a wrapper around an HTTP call.
  • A training job manages copying a dataset from S3 to a training container, training a model, and saving its artifacts to S3.
  • An endpoint is then generated through another invocation of the Sagemaker client, and the model artifacts are copied to the endpoint hosting container.
  • The inference workflow is then invoked through an AWS Lambda request, which first sends an HTTP request to the Sagemaker endpoint, which then uses it to send another request to Amazon Bedrock.

The following sections explain how to:

  • Set up your development environment
  • Load and prepare the dataset
  • Fine-tune the BLIP-2 model and learn product attributes using SageMaker
  • Deploy the fine-tuned BLIP-2 model to predict product attributes using SageMaker.
  • Generate product descriptions from predicted product attributes using Amazon Bedrock

Set up your development environment

Your AWS account must have an AWS Identity and Access Management (IAM) role with permissions to manage the resources created as part of your solution. For more information, see Create an AWS Account.

Amazon SageMaker Studio ml.t3.medium Instances and Data Science 3.0 image. However, you can also use an Amazon SageMaker notebook instance or any integrated development environment (IDE).

Notes: Be sure to set your AWS Command Line Interface (AWS CLI) credentials correctly. For more information, see Configuring the AWS CLI.

The ml.g5.2xlarge instance is used for SageMaker Training jobs and ml.g5.2xlarge The instance is used for SageMaker endpoints. If necessary, request a quota increase to ensure you have enough capacity for this instance in your AWS account. Also check out On-Demand Instance pricing.

To replicate the solution described in this post, you will need to clone this GitHub repository. First, start your notebook main.ipynb Select the image in SageMaker Studio and Data Science as a kernel Python 3Install all the required libraries as explained in . requirements.txt.

Load and prepare the dataset

In this post, we use the Kaggle Fashion Image Dataset, which contains 44,000 products with multiple category labels, descriptions, and high-resolution images. Using images and questions as input, we demonstrate how to fine-tune a model to learn attributes such as a shirt's fabric, fit, collar, pattern, and sleeve length.

Each product is identified by an ID, such as 38642, and there is a map to all products. styles.csvYou can get the image of this product here images/38642.jpg And the complete metadata is styles/38642.json. To fine-tune the model, you need to convert the structured examples into a collection of question-answer pairs. The final dataset after processing each attribute will be in the following format:

Id | Question | Answer
38642 | What is the fabric of the clothing in this picture? | Fabric: Cotton

After processing the dataset, split it into training and validation sets, create a CSV file, and upload the dataset to Amazon S3.

Fine-tune the BLIP-2 model to learn product attributes using SageMaker

The HuggingFace Estimator is required to launch a SageMaker Training job. SageMaker launches and manages all required Amazon Elastic Compute Cloud (Amazon EC2) instances, provides appropriate Hugging Face containers, uploads specified scripts, and downloads data from an S3 bucket to the container. . /opt/ml/input/data.

We fine-tune BLIP-2 using the Low Rank Adaptation (LoRA) technique, which adds a trainable rank decomposition matrix to all Transformer structural layers while keeping the pre-trained model weights static. This technique improves training throughput, reduces the amount of GPU RAM required by 3X, and reduces the number of trainable parameters by 10,000 times. Despite using fewer trainable parameters, LoRA is demonstrated to perform as well as or better than full fine-tuning techniques.

We have prepared entrypoint_vqa_finetuning.py It implements fine-tuning of BLIP-2 with LoRA technology using Hugging Face Transformers, Accelerate, and Parameter-Efficient Fine-Tuning (PEFT). The script also merges his LoRA weights into the model's weights after training. As a result, the model can be deployed as a regular model without requiring any additional code.

from peft import LoraConfig, get_peft_model
from transformers import Blip2ForConditionalGeneration
 
model = Blip2ForConditionalGeneration.from_pretrained(
        "Salesforce/blip2-flan-t5-xl",
        device_map="auto",
        cache_dir="/tmp",
        load_in_8bit=True,
    )

config = LoraConfig(
    r=8, # Lora attention dimension.
    lora_alpha=32, # the alpha parameter for Lora scaling.
    lora_dropout=0.05, # the dropout probability for Lora layers.
    bias="none", # the bias type for Lora.
    target_modules=["q", "v"],
)

model = get_peft_model(model, config)

reference entrypoint_vqa_finetuning.py as entry_point With hug face estimator.

from sagemaker.huggingface import HuggingFace

hyperparameters = {
    'epochs': 10,
    'file-name': "vqa_train.csv",
}

estimator = HuggingFace(
    entry_point="entrypoint_vqa_finetuning.py",
    source_dir="../src",
    role=role,
    instance_count=1,
    instance_type="ml.g5.2xlarge", 
    transformers_version='4.26',
    pytorch_version='1.13',
    py_version='py39',
    hyperparameters = hyperparameters,
    base_job_name="VQA",
    sagemaker_session=sagemaker_session,
    output_path=f"{output_path}/models",
    code_location=f"{output_path}/code",
    volume_size=60,
    metric_definitions=[
        {'Name': 'batch_loss', 'Regex': 'Loss: ([0-9\\.]+)'},
        {'Name': 'epoch_loss', 'Regex': 'Epoch Loss: ([0-9\\.]+)'}
    ],
)

You can start a training job by running the .fit() method and passing the image and the Amazon S3 path of the input file.

estimator.fit({"images": images_input, "input_file": input_file})

Deploying a fine-tuned BLIP-2 model to predict product attributes using SageMaker

We use the HuggingFace Inference Container to deploy the fine-tuned BLIP-2 model to a SageMaker real-time endpoint. You can also use the Large Model Inference (LMI) container, which is detailed in Build a Generative AI-Based Content Moderation Solution with Amazon SageMaker JumpStart, to deploy a pre-trained BLIP-2 model. Here, we reference the fine-tuned model in Amazon S3 instead of the pre-trained model available in the Hugging Face hub. First, create the model and deploy the endpoint.

from sagemaker.huggingface import HuggingFaceModel

model = HuggingFaceModel(
   model_data=estimator.model_data,
   role=role,
   transformers_version="4.28",
   pytorch_version="2.0",
   py_version="py310",
   model_server_workers=1,
   sagemaker_session=sagemaker_session
)

endpoint_name = "endpoint-finetuned-blip2"
model.deploy(initial_instance_count=1, instance_type="ml.g5.2xlarge", endpoint_name=endpoint_name )

The endpoint status is in useUsing an input image and a question as a prompt, the endpoint of a directed visual-to-language generation task can be invoked.

inputs = {
    "prompt": "What is the sleeve length of the shirt in this picture?",
    "image": image # image encoded in Base64
}

The output response will look like this:

{"Sleeve Length": "Long Sleeves"}

Generate product descriptions from predicted product attributes using Amazon Bedrock

To start using Amazon Bedrock, request access to the underlying model (not enabled by default). Follow the steps in the documentation to enable access to your model. This article uses Anthropic's Claude on Amazon Bedrock to generate product descriptions. Specifically, we use a model. anthropic.claude-3-sonnet-20240229-v1 Because you get better performance and speed.

After you create a boto3 client for Amazon Bedrock, create a prompt string that specifies that you want to generate a product description using product attributes.

You are an expert in writing product descriptions for shirts. Use the data below to create product description for a website. The product description should contain all given attributes.
Provide some inspirational sentences, for example, how the fabric moves. Think about what a potential customer wants to know about the shirts. Here are the facts you need to create the product descriptions:
[Here we insert the predicted attributes by the BLIP-2 model]

Prompt and model parameters are passed in the body, such as the maximum number of tokens to be used in the response and the temperature. The JSON response needs to be parsed before the resulting text is printed on the last line.

bedrock = boto3.client(service_name="bedrock-runtime", region_name="us-west-2")

model_id = "anthropic.claude-3-sonnet-20240229-v1"

body = json.dumps(
    {"system": prompt, "messages": attributes_content, "max_tokens": 400, "temperature": 0.1, "anthropic_version": "bedrock-2023-05-31"}
)

response = bedrock.invoke_model(
    body=body,
    modelId=model_id,
    accept="application/json",
    contentType="application/json"
)

The generated product description response looks like this:

"Classic Striped Shirt Relax into comfortable casual style with this classic collared striped shirt. With a regular fit that is neither too slim nor too loose, this versatile top layers perfectly under sweaters or jackets."

Conclusion

We have demonstrated how the combination of SageMaker's VLM and Amazon Bedrock's LLM provides a powerful solution for automating fashion product description generation. By fine-tuning the BLIP-2 model on a fashion dataset using Amazon SageMaker, we can predict subtle, domain-specific product attributes directly from images. We can then use Amazon Bedrock capabilities to generate product descriptions from the predicted product attributes to enhance searchability and personalization for e-commerce platforms. As we continue to explore the potential of generative AI, LLM and VLM emerge as promising avenues to revolutionize content generation in the ever-evolving online retail landscape. As a next step, you can fine-tune this model on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use case.


About the Author

AntoniaAntonia Weiberer She is a Data Scientist in the AWS Generative AI Innovation Center where she enjoys building proofs of concept for customers. Her passion is exploring how generative AI can solve real-world problems and create value for customers. When she's not coding, she enjoys running and competing in triathlons.

DanielDaniel Zagiva He is a Data Scientist with AWS Professional Services, specializing in developing scalable, production-level machine learning solutions for AWS customers, with experience in a variety of domains including Natural Language Processing, Generative AI, and Machine Learning Operationalization.

LunRun Ye I'm a machine learning engineer with AWS Professional Services. We specialize in NLP, predictive, MLOps, and generative AI, helping customers bring machine learning to their businesses. degree in Data Science and Technology from Delft University of Technology.

FotinosPhotinus Kyriakides is an AI/ML Consultant with AWS Professional Services, specializing in developing production-ready ML solutions and platforms for AWS customers. In his free time, Fotinos enjoys running and exploring.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *