An amazing experiment that shows the devil is in the details


With the increasing number of embedding models available, choosing the right embedding model for your machine learning application can be difficult. Fortunately, the MTEB leaderboard provides comprehensive ranking metrics for a variety of natural language processing tasks.
If you visit the site, you'll see that the top five embedded models are generative pretrained transformers (GPTs). For this reason, one might think that his GPT model is the best choice for embedding. But is this really true? Let's do an experiment to find out.
An embedding is a tensor representation of text that transforms and projects text token IDs into tensor space.
You can obtain the embedding vector by inputting text into a neural network model and performing a forward pass. However, the actual process is a little more complicated. Let's look at it step by step.
- Convert text to token ID
- Pass the token ID to the neural network
- Returns the output of the neural network.
The first step is to use a tokenizer to achieve that. model_inputs
A tensor representation of text content. "some questions."
.
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
messages = [
{
"role": "user",
"content": "some questions.",
},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to("cuda")
The second step is simple: perform a forward pass. model_inputs
to neural networks. Logs of generated tokens can be accessed in the following ways: .logits
. torch.no_grad()
This means that we don't want to update the weights of the model since it is in inference mode.
import torchwith torch.no_grad():
return model(model_inputs).logits
The third step is a little more difficult. The GPT model is decoder-only and its token generation is autoregressive. Simply put, the last token of a completed sentence refers to all preceding tokens in the sentence. Therefore, the output of the last token contains all the affinity scores (attention) from the previous tokens.
bingo! Due to the Transformers attention mechanism, we are most interested in the last token.
The output dimensions of GPT implemented in Hugging Face are (batch size, input token size, number of vocabulary). Run tensor slicing to get the last token output of every batch.
import torch
with torch.no_grad():
return model(model_inputs).logits[:, -1, :]
To measure the quality of these GPT embeddings, we can use cosine similarity. The higher the cosine similarity, the closer the semantic meanings of the sentences are.
import torch
def compute_cosine_similarity(vec1, vec2):
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
return cos(vec1, vec2)
Let's create some utility functions that can loop through a list of question and answer pairs and check the results. One of his great open source models, Mistral 7b v0.1 instruct, will be used for this experiment.
import torch
from termcolor import colored
from transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
def generate_last_token_embeddings(question, max_new_tokens=30):
messages = [
{
"role": "user",
"content": question,
},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to("cuda")
with torch.no_grad():
return model(model_inputs).logits[:, -1, :]
def get_similarities(questions, answers):
for question in questions:
for answer in answers:
q_embedding, a_embedding = (
generate_last_token_embeddings(question),
generate_last_token_embeddings(answer),
)
similarity = compute_cosine_similarity(q_embedding, a_embedding)
print(colored(f"question: {question} and ans: {answer}", "green"))
print(colored(f"result: {similarity}", "blue"))
questions = ["Where is the headquarter of OpenAI?", "What is GPU?"]
answers = [
"OpenAI is based at San Francisco.",
"A graphics processing unit (GPU) is an electronic circuit that can perform mathematical calculations quickly",
]
For the first question and answer pair, it looks like this:
- Question: “Where is OpenAI's headquarters?”
- Answer: “OpenAI is based in San Francisco.”
- Cosine similarity: 0.96
For the second question/answer pair:
- Question: “What is a GPU?”
- Answer: “A graphics processing unit (GPU) is an electronic circuit that can quickly perform mathematical calculations.”
- Cosine similarity: 0.94
For unrelated pairs:
- Question: “Where is OpenAI's headquarters?”
- Answer: “A graphics processing unit (GPU) is an electronic circuit that can quickly perform mathematical calculations.”
- Cosine similarity: 0.90
For the worst pair:
- Question: “What is a GPU?”
- Answer: “OpenAI is based in San Francisco.”
- Cosine similarity: 0.93
These results indicate that when using the GPT model (in this case Mistral 7b instruction v0.1), the embedding model may not provide as good results in terms of distinguishing between related and unrelated pairs. It suggests something. But why is the GPT model still in the top 5 embedded models?
tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-mistral-7b-instruct")
model = AutoModelForCausalLM.from_pretrained(
"intfloat/e5-mistral-7b-instruct"
)
e5-mistral-7b-instruct (Image by the author)
Repeat the same evaluation procedure with different models, e5-mistral-7b-instruct
It is one of the top open source models on the MTEB leaderboard, fine-tuned from mistral 7b instructions, but with associated questions and paired cosine similarities of 0.88 and 0.84 for OpenAI and GPU questions, respectively. It turns out that it is. For unrelated question-answer pairs, the similarity drops to 0.56 and 0.67. The results of this study suggest that e5-mistral-7b-instruct
is a much improved model for embedding. What brings about such improvements?
Examine the paper behind e5-mistral-7b-instruct
Importantly, we use contrast loss to further fine-tune the mistral model.
Unlike GPT, which is trained or further fine-tuned using a cross-entropy loss of predicted and labeled tokens, contrast loss maximizes the distance between negative pairs and minimizes the distance between positive pairs. The purpose is that.
This blog post explains this concept in detail.of sim
The function calculates the cosine distance between two vectors. For contrastive losses, the denominator represents the cosine distance between the positive and negative examples. The rationale behind contrastive loss is that log(1) = 0 represents the optimal loss, so we want similar vectors to be as close to 1 as possible.
In this post, we highlighted common pitfalls when using GPT as an embedded model without fine-tuning. My evaluation suggests that fine-tuning his GPT using contrast loss could make the embeddings more meaningful and discriminative. Make more informed decisions when selecting and utilizing embedded models for machine learning projects by understanding the strengths and limitations of GPT models and leveraging customized losses such as contrastive loss can do. We hope this post helps you choose his GPT model wisely for your application and we welcome your feedback. 🙂