Hug Face Transformer in action: Learn how to leverage AI for NLP

Machine Learning


(NLP) has revolutionized the way we interact with technology.

Remember when chatbots first came out and sounded like robots? Thankfully, that's a thing of the past.

Transformer models have waved their magic wand and reshaped NLP tasks. But before I stop and think about this post. “Hey, Transformers are too dense to learn”, Please be patient. We won't go into another technical article explaining the math behind this amazing technology, but instead we'll actually learn what it can do for us.

Hugging Face's Transformers Pipeline makes NLP tasks easier than ever.

Let's explore!

The only explanation of what a transformer is

Let's think about it transformer model As an elite in the NLP world.

Transformers are great because they can focus on different parts of an input sequence through a mechanism called “self-attention.”

Transformers are powerful because “Attention to self” This is a feature that allows you to decide at any time which particular part of a sentence is most important to focus on.

Have you heard of BERT, GPT or RoBERTa? That's them! Bart (Bidirectional encoder representation from transformer) is an innovative Google AI language model developed in 2018 that understands the context of text by reading words from both left-to-right and right-to-left simultaneously.

Enough talk, let's get down to business. transformers package [1].

Transformers pipeline overview

The Transformers library provides a complete toolkit for training and running state-of-the-art pre-trained models. Our main subject, the Pipeline class, provides an easy-to-use interface for a variety of tasks, including:

  • Text generation
  • image segmentation
  • voice recognition
  • Document QA.

preparation

Before you begin, let's do the basics and gather your tools. You'll need Python, a transformer library, and probably either PyTorch or TensorFlow. Installation will proceed normally. pip install transformers.

IDEs like Anaconda and platforms like Google Colab already include these as standard installations. No problem.

The Pipeline class allows you to perform many machine learning tasks using the models available in Hugging Face Hub. It's as easy as plug and play.

All tasks come with a preconfigured default model and preprocessor, but you can easily customize this using the model parameter and switch to a different model of your choice.

code

Before we go deeper, let's start with Transformers 101 and see how it works. The first task we perform is a simple sentiment analysis of a particular news headline.

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("Instagram wants to limit hashtag spam.")

The response is as follows.

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.

Device set to use cpu
[{'label': 'NEGATIVE', 'score': 0.988932728767395}]

Because we did not supply model Default options were used for the parameters. As a classification, we found that sentiment towards this headline was 98% negative. Additionally, you can classify not just a single sentence, but a list.

Very easy, right? But that's not all. We will continue to explore other great features.

Zero shot classification

Zero-shot classification means labeling text that is not already labeled. So there's no clear pattern to it. Then all you have to do is pass some classes to your model and select one. This is very useful when creating training datasets for machine learning.

This time we are giving the method . model A list of arguments and statements to classify.

classifier = pipeline("zero-shot-classification", model = 'facebook/bart-large-mnli')
classifier(
    ["Inter Miami wins the MLS", "Match tonight betwee Chiefs vs. Patriots", "Michael Jordan plans to sell Charlotte Hornets"],
    candidate_labels=["soccer", "football", "basketball"]
    )
[{'sequence': 'Inter Miami wins the MLS',
  'labels': ['soccer', 'football', 'basketball'],
  'scores': [0.9162040948867798, 0.07244189083576202, 0.011354007758200169]},
 {'sequence': 'Match tonight betwee Chiefs vs. Patriots',
  'labels': ['football', 'basketball', 'soccer'],
  'scores': [0.9281435608863831, 0.0391676239669323, 0.032688744366168976]},
 {'sequence': 'Michael Jordan plans to sell Charlotte Hornets',
  'labels': ['basketball', 'football', 'soccer'],
  'scores': [0.9859175682067871, 0.009983371943235397, 0.004099058918654919]}]

It looks like the model did a great job labeling these sentences.

text generation

Packages can also generate text. This is a great way to create a nice little story generator to tell your kids before bed. is increasing. temperature Make your model more creative with parameters.

generator = pipeline("text-generation", temperature=0.8)
generator("Once upon a time, in a land where the King Pineapple was")
[{'generated_text': 
"Once upon a time, in a land where the King Pineapple was a common
 crop, the Queen of the North had lived in a small village. The Queen had always 
lived in a small village, and her daughter, who was also the daughter of the Queen,
 had lived in a larger village. The royal family would come to the Queen's village,
 and then the Queen would return to her castle and live there with her daughters. 
In the middle of the night, she would lay down on the royal bed and kiss the princess
 at least once, and then she would return to her castle to live there with her men. 
In the daytime, however, the Queen would be gone forever, and her mother would be alone.
The reason for this disappearance, in the form of the Great Northern Passage 
and the Great Northern Passage, was the royal family had always wanted to take 
the place of the Queen. In the end, they took the place of the Queen, and went 
with their daughter to meet the King. At that time, the King was the only person 
on the island who had ever heard of the Great Northern Passage, and his return was
 in the past.
After Queen Elizabeth's death, the royal family went to the 
Great Northern Passage, to seek out the Princess of England and put her there. 
The Princess of England had been in"}]

Name and entity recognition

This task can recognize people (PER), places (LOC), or entities (ORG) within a given text. This is great for creating a quick marketing list of prospect names, for example.

ner = pipeline("ner", grouped_entities=True)
ner("The man landed on the moon in 1969. Neil Armstrong was the first man to step on the Moon's surface. He was a NASA Astronaut.")
[{'entity_group': 'PER', 'score': np.float32(0.99960065),'word': 'Neil Armstrong',
  'start': 36,  'end': 50},

 {'entity_group': 'LOC',  'score': np.float32(0.82190216),  'word': 'Moon',
  'start': 84,  'end': 88},

 {'entity_group': 'ORG',  'score': np.float32(0.9842771),  'word': 'NASA',
  'start': 109,  'end': 113},

 {'entity_group': 'MISC',  'score': np.float32(0.8394754),  'word': 'As',
  'start': 114,  'end': 116}]

summary

Perhaps one of the most commonly used tasks, summarization involves reducing text while preserving its essence and important parts. Let's summarize this Wikipedia page about Transformers.

summarizer = pipeline("summarization")
summarizer("""
In deep learning, the transformer is an artificial neural network architecture based
on the multi-head attention mechanism, in which text is converted to numerical
 representations called tokens, and each token is converted into a vector via lookup
 from a word embedding table.[1] At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Transformers have the advantage of having no recurrent units, therefore requiring 
less training time than earlier recurrent neural architectures (RNNs) such as long 
short-term memory (LSTM).[2] Later variations have been widely adopted for training
 large language models (LLMs) on large (language) datasets.[3]
""")
[{'summary_text': 
' In deep learning, the transformer is an artificial neural network architecture 
based on the multi-head attention mechanism . Transformerers have the advantage of
 having no recurrent units, therefore requiring less training time than earlier 
recurrent neural architectures (RNNs) such as long short-term memory (LSTM)'}]

wonderful!

image recognition

There are also more complex tasks, such as image recognition. It's just as easy to use as the others.

image_classifier = pipeline(
    task="image-classification", model="google/vit-base-patch16-224"
)
result = image_classifier(
    "https://images.unsplash.com/photo-1689009480504-6420452a7e8e?q=80&w=687&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
)
print(result)
Photo by Vitalii Khodzinskyi on Unsplash
[{'label': 'Yorkshire terrier', 'score': 0.9792122840881348}, 
{'label': 'Australian terrier', 'score': 0.00648861238732934}, 
{'label': 'silky terrier, Sydney silky', 'score': 0.00571345305070281}, 
{'label': 'Norfolk terrier', 'score': 0.0013639888493344188}, 
{'label': 'Norwich terrier', 'score': 0.0010306559270247817}]

So by looking at these two examples, you can easily see how easy it is to use the Transformers library to perform various tasks with very little code.

summary

What happens when you apply and synthesize your knowledge into small, practical projects?

Let's create a simple Streamlit app that can read resumes, return sentiment analysis, and classify the tone of the text as: ["Senior", "Junior", "Trainee", "Blue-collar", "White-collar", "Self-employed"]

The following code will look like this:

  • Import the package
  • Create a page title and subtitle
  • Add a text input area
  • Tokenize text and break it into chunks for transformer tasks. View model list [4].
import streamlit as st
import torch
from transformers import pipeline
from transformers import AutoTokenizer
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

st.title("Resumé Sentiment Analysis")
st.caption("Checking the sentiment and language tone of your resume")

# Add input text area
text = st.text_area("Enter your resume text here")

# 1. Load your desired tokenizer
model_checkpoint = "bert-base-uncased" 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# 2. Tokenize the text without padding or truncation
# We return tensors or lists to slice them manually
tokens = tokenizer(text, add_special_tokens=False, return_tensors="pt")["input_ids"][0]

# 3. Instantiate Text Splitter with Chunk Size of 500 words and Overlap of 100 words so that context is not lost
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# 4. Split into chunks for efficient retrieval
chunks = text_splitter.split_documents(text)

# 5. Convert back to strings or add special tokens for model input
decoded_chunks = []
for chunk in chunks:
    # This adds [CLS] and [SEP] and converts back to a format the model likes
    final_input = tokenizer.prepare_for_model(chunk.tolist(), add_special_tokens=True)
    decoded_chunks.append(tokenizer.decode(final_input['input_ids']))

st.write(f"Created {len(decoded_chunks)} chunks.")

Next, start the transformer pipeline to do the following:

  • Performs sentiment analysis and returns the confidence level (%).
  • Classifies the tone of the text and returns the % confidence level.
# Initialize sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

# Perform sentiment analysis    
if st.button("Analyze"):
    col1, col2 = st.columns(2)

    with col1:  
        # Sentiment analysis
        sentiment = sentiment_pipeline(decoded_chunks)[0]
        st.write(f"Sentiment: {sentiment['label']}")
        st.write(f"Confidence: {100*sentiment['score']:.1f}%")
    
    with col2:
        # Categorize tone
        tone_pipeline = pipeline("zero-shot-classification", model = 'facebook/bart-large-mnli',
                                candidate_labels=["Senior", "Junior", "Trainee", "Blue-collar", "White-collar", "Self-employed"])
        tone = tone_pipeline(decoded_chunks)[0]
        
        st.write(f"Tone: {tone['labels'][0]}")
        st.write(f"Confidence: {100*tone['scores'][0]:.1f}%")

Here's a screenshot.

Analysis of emotions and language tone. Image by author.

before departure

The Hugging Face (HF) transformer pipeline is truly game-changing for data practitioners. They provide an incredibly streamlined way to tackle complex machine learning tasks like text generation and image segmentation using just a few lines of code.

HF already does the heavy lifting by wrapping sophisticated model logic into simple and intuitive methods.

This shifts your focus away from low-level coding and lets you focus on what really matters, allowing you to use your creativity to build impactful, real-world applications.

If you liked this content, find out more about me on my website.

https://gustavorsantos.me

GitHub repository

https://github.com/gurezende/Resume-sentiment-analysis

References

[1. Transformers package] https://huggingface.co/docs/transformers/index

[2. Transformers Pipelines] https://huggingface.co/docs/transformers/pipeline_tutorial

[3. Pipelines Examples] https://huggingface.co/learn/llm-course/chapter1/3#summarization

[3. HF Models] hugface.co/model



Source link