Beginner’s Guide to Language Models

The way we extract information from text data has changed dramatically over the past decade.as a term natural language processing The name of the field has replaced text mining, and the methodology has changed a lot. One of the main drivers of this change is the emergence of language models as the basis for many applications aimed at extracting valuable insights from raw text.

Language model definition

The language model uses machine learning to run a word probability distribution that is used to predict the most likely next word in a sentence based on previous entries. Language models learn from text and can be used to create original text, predict the next word in text, speech recognition, optical character recognition, and handwriting recognition.

In learning about natural language processing, I’ve been fascinated by the evolution of language models over the last few years.you may have I heard about GPT-3 and the Potential threat it poses, but how did you get here? How can machines create articles that mimic journalists?

What is a language model

The language model is probability distribution About a word or word sequence. It actually gives you the probability that a particular word sequence is “valid”. Validity in this context does not refer to grammatical validity. Instead, it is meant to resemble the way people write, which the language model learns. This is an important point.No magic like other language models machine learning modelespecially deep neural networkis just a tool for incorporating rich information in a concise way that is reusable in contexts outside the sample.

Data science details: Basic Probability Theory and Statistical Terms You Should Know

What language models can do

The abstract understanding of natural language required to infer word probabilities from context can be used for many tasks. Lemmatization, or stemming, aims to reduce words to their most basic form, thereby dramatically reducing the number of tokens. These algorithms work well when the part-of-speech role of a word is known.The rationale for part-of-speech tagging (or POS tagging), a common task for language models.

If you have a good language model, you can do extractive or abstract summary of the text. If you have models in different languages, machine translation You can easily build a system. Non-trivial use cases include answers to questions (with or without context, see the example at the end of the article).Language models can also be used for voice recognition, OCR, handwriting recognition more. There are many opportunities.

Language model type

There are two types of language models:

probabilistic method.
State-of-the-art language models based on neural networks

It’s important to note the difference between them.

probabilistic language model

simple probabilistic language model is constructed by computing n-gram probabilities. An n-gram is a sequence of n words, where n is an integer greater than 0. The probability of an n-gram is the conditional probability that the last word of the n-gram follows a given n-1 gram (excluding the last word). This is the percentage of occurrences of the last word following an n-1 gram excluding the last word. This concept is a Markov assumption. Given n-1 grams (present), n-gram probabilities (future) do not depend on n-2, n-3, etc. grams (past).

This approach has obvious drawbacks. Most importantly, only the n preceding words affect the probability distribution of the next word. Complex texts have deep context that can decisively influence your next word choice. So even if n is 20 or 50, the next word may not be obvious from the previous n words. Terms influence the choice of previous words. That is the word United. It’s much more likely if the state follows next. Let’s call this the context problem.

Moreover, this approach is clearly not scalable. As size increases (n), the number of possible permutations explodes, but most permutations never occur in the text. And all occurrence probabilities (or all n-gram counts) must be calculated and stored. Additionally, n-grams that do not occur cause sparsity problems., Similarly, the granularity of probability distributions can be very low. Word probabilities have few different values, so most words have the same probabilities.

Neural network based language model

Neural network based language model The way the input is encoded mitigates the sparsity problem. The word embedding layer creates an arbitrarily sized vector of each word that also incorporates semantic relationships. These continuous vectors create the desired granularity for the following word probability distributions: Furthermore, since language models are functions and all neural networks use a lot of matrix computations, we don’t need to store all the n-gram counts to generate the probability distribution for the next word.

An error has occurred

Cannot run JavaScript. Watch this video at www.youtube.com or enable JavaScript if it’s disabled in your browser.

A tutorial on language model basics. | | Video: Viktor Lavrenko

Evolution of the language model

Neural networks solve the sparsity problem, but the context problem remains. First, language models were developed to solve contextual problems more efficiently, and more and more contextual words influenced probability distributions. A secondary goal was to create an architecture that would allow the model to learn which contextual words are more important than others.

The first model outlined earlier is a dense (or hidden) layer and an output layer stacked on top of a continuous bag-of-words (CBOW). Word2Vec model. The CBOW Word2Vec model is trained to guess words from context. The Skip-Gram Word2Vec model does the opposite, inferring context from words. In practice, training a CBOW Word2Vec model requires many examples of the following structure: The input is n words before and/or after the word that is the output. It turns out that the problem of context still remains.

Recurrent Neural Networks (RNN)

Recurrent Neural Networks (RNN) improvement on this matter. RNNs can be either long short-term memory (LSTM) or gated repetition unit (GRU) cell-based networks, so they take into account all previous words when choosing the next word.Allen NLP Elmo We take this concept a step further and utilize a bidirectional LSTM that takes into account the context before and after the word count.

transformers

A major drawback of RNN-based architectures stems from their sequential nature. As a result, training times for long sequences skyrocket due to lack of parallelization potential.of Resolution This problem Transformer architecture.

GPT model from Open AI and Google’s bart It also makes use of the Transformer architecture. These models also employ a mechanism called “Attention”. This allows the model to learn which inputs are more interesting than others in certain cases.

In terms of model architecture, the main quantum leap was first RNN, specifically LSTM and GRU, to solve the sparsity problem and reduce language model usage. Next is the Transformer architecture, which enables parallelization and creates an attention mechanism. But architecture isn’t the only thing that language models excel at.

Compared to the GPT-1 architecture, GPT-3 has little new features. But it’s huge. This he has 175 billion parameters and was trained on the largest corpus a model has ever been trained on a general crawl. This is partly possible because of the language model’s semi-supervised training strategy. Text with some words omitted can be used as a training example. GPT-3’s incredible power stems from the fact that it has the ability to read almost any text that has appeared on the Internet in the last few years and to reflect most of the complexity contained in natural language. I’m here.

trained for multiple purposes

Finally, I would like to review Google’s T5 modelPreviously, language models were used for standard NLP tasks such as part-of-speech (POS) tagging and machine translation with minor modifications.in a little while retrainingBERT can be a POS tagger because of its abstract ability to understand the underlying structure of natural language.

T5 does not require any changes to NLP tasks.Several When you get the text with tokens, it recognizes that those tokens are gaps to fill with proper words. You can also answer questions. If it receives some context after the question, it searches for the answer in that context. Otherwise, answer from your own knowledge. Fun fact: I beat my own creator in a trivia quiz.

Language model details: NLP for Beginners: The Complete Guide

The future of language models

Personally, I think it’s the closest area to creating AI. There’s a lot of talk about AI, many simple decision-making systems and nearly all neural networks are called AI, but it’s mostly about marketing. By definition, artificial intelligence includes human-like intelligence functions performed by machines. Transfer learning excels in the field of computer vision, and the concept of transfer learning is essential to AI systems, but the very fact that the same model can perform a wide range of NLP tasks and infer what to do from the input is itself. Spectacular. This brings us one step closer to actually creating a human-like intelligent system.

Source link

Beginner’s Guide to Language Models

Language model definition

What is a language model

What language models can do