Reddit wants to help educate and reward large AI systems

Reddit has long been a hotspot of conversation on the internet. About 57 million people visit the site every day to chat about a variety of topics, including makeup, video games, and car wash directions.

In recent years, a series of Reddit chats have also become free educational materials for companies such as Google, OpenAI, and Microsoft. These companies are using Reddit conversations to develop massive artificial intelligence systems that many in Silicon Valley believe are on their way to becoming the tech industry’s next big thing.

Now Reddit wants to pay the price. The company announced on Tuesday that it will start charging businesses for access to its application programming interface (API). APIs are a way for external entities to download and process a huge variety of person-to-person conversations in social networks.

Reddit founder and CEO Steve Huffman said in an interview: “But you don’t have to give all of that value to the world’s biggest companies for free.”

The move is one of the first significant examples of charging for access to conversations hosted by social networks for the purpose of developing AI systems like ChatGPT, OpenAI’s popular program. These new AI systems could one day lead to big companies, but they’re unlikely to help companies like Reddit. In fact, they could be used to create competitors — automated duplicates of Reddit conversations.

Reddit is also gearing up for an initial public offering on Wall Street this year. Founded in 2005, the company makes most of its revenue through advertising and e-commerce transactions on its platform. Reddit said it is still finalizing pricing details for API access and will announce pricing in the coming weeks.

Reddit’s conversational forums have become a valuable commodity as large-scale language models (LLMs) have become an integral part of creating new AI technologies.

LLM is essentially an advanced algorithm developed by companies such as Microsoft’s close partners Google and OpenAI. For algorithms, conversations on Reddit are data, one of a vast pool of material fed to LLMs for development.

The underlying algorithms that helped build Bard, Google’s conversational AI service, were partially trained on Reddit data. OpenAI’s Chat GPT cites Reddit data as one of the sources it was trained on.

Other companies are starting to see value in the conversations and images they host. Image hosting service Shutterstock also sold image data to his OpenAI to help create DALL-E, an AI program that creates vibrant graphic images with only text-based prompts.

Last month, Twitter owner Elon Musk said he was cracking down on the use of Twitter’s API. This API is used by thousands of companies and independent developers to track millions of conversations on their networks. He didn’t name his LLM as the reason for the change, but the new fee could be in the tens of thousands, possibly hundreds of thousands of dollars.

To keep improving their models, artificial intelligence makers need two key ingredients. It’s a huge amount of computing power and a huge amount of data. Some large AI developers have plenty of computing power, but look outside their networks for the data they need to improve their algorithms. This includes sources such as Wikipedia, millions of digitized books, scholarly articles, and Reddit.

Reps from Google, Open AI, and Microsoft have not yet responded to requests for comment.

Reddit has long had a symbiotic relationship with search engines from companies like Google and Microsoft. Search engines “crawl” Reddit web pages to index information and make it available in search results. This crawling or “scraping” is not always welcomed by all sites on the Internet. But Reddit benefits from appearing higher in search results.

For LLM, the dynamics are different. LLM collects as much data as possible to create new AI systems like chatbots.

We consider Reddit particularly valuable because its data is continuously updated. According to Huffman, that novelty and relevance are what large-scale language modeling algorithms need to produce the best results.

“More than anywhere else on the internet, Reddit is home for real conversation,” Huffman said. “There are a lot of things on this site that you would only say in therapy or AA, or not at all.”

Huffman said Reddit’s API will continue to be free to developers who want to build applications that help people use Reddit. For example, tools can be used to build bots that automatically track whether user comments follow the posting rules. Researchers who want to study Reddit data for academic or non-commercial purposes will continue to have free access.

Reddit also wants to incorporate so-called machine learning into how the site itself operates. For example, it can be used to identify the use of AI-generated text on Reddit and add a label to notify users that the comment is from a bot.

The company has also promised to improve the software tools available to its moderators — users who dedicate their time to keeping the site’s forums running smoothly and improving conversations between users. We also continue to support third-party bots that help moderators monitor forums.

But for AI makers, it’s time to pay.

“Crawling Reddit to create value and not return that value to users is a problem for us,” Huffman said. “It’s a good time to tighten things up.”

“I think it’s fair,” he added.

Source link