- A CNBC study found that Google’s PaLM 2 large-scale language model uses nearly five times as much text data for training as the previous generation of LLM.
- When Google announced the PaLM 2 last week, it said the model is smaller than its predecessor, but uses more efficient “technology.”
- The lack of transparency around training data for artificial intelligence models is an increasing topic of discussion among researchers.
Sundar Pichai, CEO of Alphabet Inc., at the Google I/O developer conference in Mountain View, Calif., Wednesday, May 10, 2023.
David Paul Morris | Bloomberg | Getty Images
The company announced last week that Google’s new large-scale language model will use almost five times more training data than its predecessor from 2022 and will be able to perform more advanced coding, math and creative writing tasks. CNBC revealed.
According to internal documents seen by CNBC, PaLM 2, the company’s new general-purpose large-scale language model (LLM) announced at Google I/O, has been trained with 3.6 trillion tokens. Tokens, which are strings of words, are a key building block for training LLMs, as they teach the model to predict the next word it appears in a sequence.
Google’s previous version of PaLM (short for Pathways Language Model) was released in 2022 and trained on 780 billion tokens.
Google is keen to showcase the capabilities of its artificial intelligence technology and how it can be incorporated into search, email, word processors and spreadsheets, but the company is reluctant to disclose the size and other details of its training data. It was targeted. OpenAI, creator of the Microsoft-backed ChatGPT, has also kept details of his latest LLM, called GPT-4, secret.
Both companies said the reason for the lack of disclosure is the competitiveness of their business. Google and OpenAI are racing to win over users who want to find information using conversational chatbots instead of traditional search engines.
But as the AI arms race heats up, the research community is calling for greater transparency.
Since the PaLM 2 announcement, Google has said the new model will be smaller than its predecessor LLM, which means the company’s technology will be more efficient while performing more advanced tasks. This is important. According to internal documents, PaLM 2 was trained with 340 billion parameters, which shows the complexity of the model. Early his PaLM was trained on 540 billion parameters.
Google did not immediately comment on the matter.
In a blog post about PaLM 2, Google said the model uses a “new technique” called “compute-optimized scaling.” This makes LLM “more efficient with faster inference, fewer parameters to serve, lower serving costs, and better overall performance.”
In announcing PaLM 2, Google confirmed CNBC’s earlier reports that the model was trained in 100 languages and performed a wide range of tasks. It has already been used to power 25 features and products, including the company’s experimental chatbot Bard. From smallest to largest, there are his four sizes: gecko, otter, bison, and unicorn.
Based on public information, PaLM 2 is more powerful than any existing model. The LLM, called LLaMA, announced by Facebook in February, is being trained with 1.4 trillion tokens. The last time OpenAI shared a training size for ChatGPT was at GPT-3, and the company said it was trained with 300 billion tokens at the time. OpenAI released GPT-4 in March and said it showed “human-level performance” in many professional tests.
LaMDA, the conversational LLM Google introduced two years ago and promoted alongside Bard in February, was trained with 1.5 trillion tokens, according to the latest documents seen by CNBC.
As new AI applications quickly become mainstream, the controversy surrounding the underlying technology is intensifying.
El Mahdi Google Research senior scientist El Mhamdi resigned in February, citing the company’s lack of transparency. OpenAI CEO Sam Altman testified Tuesday before the Senate Judiciary Subcommittee on Privacy and Technology, agreeing with lawmakers that we need a new system to deal with AI. bottom.
“Very new technologies require new frameworks,” Altman said. “Certainly, companies like ours owe a great deal of responsibility to the tools we put out there.”
— CNBC’s Jordan Novet contributed to this report.
clock: OpenAI CEO Sam Altman Calls for Oversight of AI