Stack Overflow charges AI giants for training data

Hundreds of millions of dollars are spent developing the AI systems behind tools like ChatGPT and the image generator Dall-E.

OpenAI, Google, and other companies building large-scale AI projects have traditionally made much of their training data freely available by scraping it from the web. But Stack Overflow, a popular internet forum for computer programming help, will start charging a large AI developer later this year for access to his 50 million questions and answers on the service. It is due, said his CEO Prashanth Chandrasekar. The site has over 20 million registered users.

Stack Overflow’s decision to seek compensation from companies that steal data as part of its broader generative AI strategy was previously unreported. This follows Reddit’s announcement earlier this week that starting in June, some AI developers will start charging to access their own content.

It’s not just the two community sites that want to share. The News/Media Alliance, a trade association of US publishers including Condé Nast, which owns WIRED, today announced that generative AI developers will negotiate and receive fair compensation for the use of their data for training and other purposes. It has issued principles that call for respecting rights.

Meta, Google, and OpenAI, the makers of ChatGPT, have all developed their AI systems using data sets curated with content from thousands of online sources, including Stack Overflow and Reddit. AI text generators and chatbots can become more fluent and knowledgeable by feeding text from online jokes and expert discussions about programming into machine learning algorithms known as large-scale language models (LLMs). Helpful. Using LLMs to generate programming code is seen as one of the technology’s greatest opportunities, with Microsoft even charging his $19 per person per month for code generator GitHub Copilot.

“Community platforms that promote LLM should absolutely be compensated for their contributions so companies like ours can reinvest in their communities and keep them flourishing,” he said. Stack Overflow’s Chandrasekar says. “We are very supportive of Reddit’s approach.”

Chandrasekar explained that the additional revenue potential is essential for Stack Overflow to be able to continue to attract users and maintain high quality information. He claims it will also help future chatbots. They need to generate new knowledge. But isolating valuable data can also hinder some AI training and slow LLM improvements. LLM threatens any service that people rely on for information and conversation. Chandrasekar says a proper license will only help accelerate the development of his LLM of high quality.

All AI developers are trying to reduce the enormous cost of developing large AI systems, which require huge amounts of expensive computers. Having to pay for data you once got for free could extend an already unclear timeline to profit from emerging technologies. Meta and Google did not immediately comment.

A large language model can generate strings of text based on word patterns learned from web pages, books, and other bodies of text in the training data. Besides ChatGPT, these programs form the backbone of search chatbots such as Microsoft Bing Chat and Google’s Bard, a growing number of applications that produce professional and creative copy in seconds. Comparable AI-composed illustration and video generation draw patterns from image datasets, such as photos collected from Pinterest and Flickr.

Data sets used in AI development are often constructed through informal means, such as dispatching software to scrape content from websites. In the United States, this is generally considered legal, but is still controversial due to copyright issues and website usage terms for practices.

Some websites such as Reddit and Stack Overflow are more attractive. Provide downloadable “data dumps” or real-time data portals to allow software access to content called APIs. For Stack Overflow, LLM developers use a combination of dumps, APIs, and scraping to get their hands on data, Chandrasekar said. All of this is now free.

But Chandrasekar says the LLM developers have violated Stack Overflow’s terms of service. As outlined in the TOS, a user owns the content she posts to Stack Overflow, but anyone who later uses the data must mention the origin of that data, Creative Commons of hers. Applicable to license. When an AI company sells its model to a customer, it says “he cannot identify each member of the question-and-answer community used to train the model, violating the Creative Commons license.” ”he says Chandrasekar.

Neither Stack Overflow nor Reddit publish pricing information. Reddit spokesperson Tim Rathschmidt said: Stack Overflow plans to study Reddit’s strategy and consult with its own potential customers, some of whom have already been in touch about data access, he said.

A potential roadmap for pricing could come from Elon Musk, who raised the price for access to Twitter data this month. Starting at $42,000 a month, he gets access to 50 million tweets. Nearly three times the volume of tweets previously offered for free.of Tweet of the week, Musk accused Microsoft, a leading AI developer and a close partner of OpenAI, of “illegally using Twitter data” for its training algorithms. Without elaborating, he added, “Time for litigation.”

Both Stack Overflow and Reddit will continue to license their data for free to select individuals and businesses. Chandrasekar said Stack Overflow is seeking compensation only from companies developing LLMs for large-scale commercial purposes. “When people start charging for products built on sites built by the community, like ours, it’s not fair use,” he says.

Reddit CEO Steve Huffman said: new york times This week, he didn’t want to give the world’s biggest company a freebie. Told.

As expectations grow that ChatGPT-style bots and other products built on LLM will make huge profits, other companies stocking content needed to train machine learning algorithms also want to pay. Some news outlets have been cautious about how Microsoft’s new Bing chatbot will handle their content.

However, so far, very few public deals have been announced for access to training data. For example, his Shutterstock at Photobank agrees to license content to his OpenAI. Its rival Getty Images is suing OpenAI competitor Stability AI for not seeking a license before using more than 12 million photos. The AI startup’s response is due next week in US federal court.

AI developers are not yet under pressure to pay. Some companies with large volumes of academic texts and casual conversations have said they have no plans to start charging for APIs or similar data portals. PLOS, the publisher of scientific research whose content is being used to train his AI, is “unlikely” to change its fairly open-ended terms of use, spokesman David Knutson said. increase. His Discord, an online community platform, has no plans to change its API service, which is offered free of charge under conditions that ban AI training, spokeswoman Swaleha Carlson said.

On Stack Overflow, API billing is just one part of a broader AI strategy the company plans to announce in the coming months. About 10% of Stack Overflow’s nearly 600 staff are focused on this initiative, which includes developing their own generative AI services. For example, the Assistant feature helps guide users in composing questions to post.

Until now, the primary action of the Stack Overflow community has been to ban users from posting AI-generated responses. Chandrasekar said the spike in incorrect answers after the launch of ChatGPT challenged the company’s hundreds of moderators.

Founded in 2008, Stack Overflow generates nearly half of its revenue from selling ads and licensing its Q&A software for internal use as subscriptions to over 1,200 organizations. The company’s revenue increased 33% to $45 million in the six months ended Sept. 30, 2022. During that time, an average of about 200,000 new users were registered each month.

Those users could reasonably claim compensation for themselves if Stack Overflow were to successfully license the questions and answers they wrote to AI makers for free. said to “We are completely thinking about the best way to ensure our community members and the people who make the site what it is today – how to care for them in the context of what is happening here. increase.”

Source link