Machine learning (ML), a subfield of artificial intelligence, teaches computers to solve tasks based on structured data, language, speech, or images by providing examples of inputs and desired outputs . This differs from traditional computer programming, where programmers write a series of specific instructions. where the ML model is learn Tweaking many knobs (often millions) produces the desired output.
ML has a history of developing methods with hand-crafted features that may only work for specific narrow problems. There are several such examples. In texts, scientific or literary classification of documents can sometimes be solved by counting the number of times a particular word occurs. Audio recognizes spoken text by converting the speech into a time-frequency representation. In the image, we can find the car by checking for the presence of a specific car-like edge-shaped pattern.
Such handcrafted features are combined with simple or shallowly trained classifiers, typically with up to tens of thousands of knobs. In technical terms, these knobs are called parameters.
deep neural network
In the early 2010s, deep neural networks (DNNs) took ML by storm, replacing the classic pipeline of hand-crafted features and simple classifiers. A DNN takes a complete document or image and produces a final output without specifying a particular method for extracting features.
Such deep and large models existed in the past, but their large size, reaching millions of parameters, hindered their use. His resurgence of DNNs in the 2010s was due to the availability of large-scale data and fast parallel computing chips called graphics processing units.
Additionally, the models used for text and images were still different. Recurrent neural networks were popular for language understanding, and convolutional neural networks (CNNs) were popular for computer vision, or machine understanding of the visual world.
‘ All you need is attention.”
In the pioneering paper “attention is all you need” published in 2017, a team at Google proposed Transformers. Transformers is his DNN architecture that is currently gaining popularity across all modalities such as image, speech and language. In the original paper, we proposed a transformer for the task of translating sentences from one language to another, similar to what Google Translate does when converting from English to Hindi.
A transformer is a neural network with two parts. The first part is an “encoder” that takes the input sentence in the source language (eg English). The second is a “decoder” that produces translated sentences in the target language (Hindi).
The encoder converts each word in the source sentence into an abstract numeric form that captures the word’s meaning within the context of the sentence and stores it in a memory bank. Just like a human writes and speaks, the decoder generates one word at a time by looking at what has been generated so far and going up memory banks to find the appropriate word. increase. Both of these processes use a mechanism called ‘attention’, hence the name of this paper.
An important improvement over previous methods is Transformer’s ability to translate long sentences and paragraphs correctly.
Since then, the adoption of transformers has exploded. For example, the capital ‘T’ in ChatGPT stands for ‘transformer’.
Transformers are also popular in computer vision. Transformers simply cut the image into small square patches and line them up like words in a sentence. By doing so, the transformer can provide better performance than CNN after training on a large amount of data. Transformer models currently constitute the best approach for image classification, object detection and segmentation, action recognition, and many other tasks.
Transformers’ ability to bring everything in is being harnessed to create a joint visual and verbal model that allows users to search for images (such as Google Images), describe them, and even answer questions about them. increase.
What is “Caution”?
Attention in ML allows the model to learn how much importance should be given to different inputs. In the translation example, attention allows the model to select or weight words from its memory bank when deciding which word to generate next. Paying attention when describing an image allows the model to see relevant parts of the image when generating the next word.
An attractive aspect of attention-based models is their ability to self-discover by parsing large amounts of data. For translation, the model is never told that the word “dog” in English means “कुत्ता” in Hindi. Instead, we find these associations by looking at some training sentence pairs where “dog” and “dog” appear together.
A similar observation applies to image captions. For an image of a “bird flying over water”, the model is never told which areas of the image correspond to the “bird” and which correspond to the “water”. Instead, by training on a few image-caption pairs containing the word “bird”, it discovers common patterns in images for associating flying objects with “birds”.
Transformers is a featured model on steroids. They have several layers of attention within the encoder to provide meaningful context across input sentences or images, as well as a meaningful transition from decoder to encoder when generating translations or describing images. provide some context.
1 billion, 1 trillion scale
In the last year, Transformers models have scaled and are being trained on more data than ever before. When these colossi are trained on written text, they are called large-scale language models (LLMs). ChatGPT uses hundreds of billions of parameters, while GPT-4 uses hundreds of trillions.
These models have been trained on simple tasks such as filling in blanks and predicting the next word, but they can also answer questions, compose stories, summarize documents, write code, and even perform mathematical tasks. It is also very good at solving word problems step by step. Transformers are also driving force A model that produces realistic images and sounds. Transformers can be used in many different areas, making them very powerful and universal models.
However, I have some concerns. The scientific community has yet to find a way to rigorously evaluate these models. There are also instances of ‘hallucinations’ where models make confident but false claims. We urgently need to address societal concerns that arise as a result of data use, such as data privacy and attribution to creative works.
At the same time, given the impressive progress and ongoing efforts to create guardrails to guide their use and leverage these models for positive outcomes (health, education, agriculture, etc.), optimism is It’s not misplaced.
Dr. Makarand Tapaswi is Senior Machine Learning Scientist at Wadhwani AI, a non-profit organization on the use of AI for social welfare, and Assistant Professor in the Computer Vision Group at IIIT Hyderabad, India.