Reddit blocks AI bots from crawling its website

In the coming weeks, Reddit will begin blocking most automated bots from accessing public data, and anyone using Reddit content to train models or for other commercial purposes will have to enter into a licensing agreement, as Google and OpenAI have done.

While this was technically already Reddit's policy, the company is now enforcing it by updating its robots.txt file, a core part of the web that specifies how web crawlers can access the site. “This is a signal to people who don't have a contract with us that they shouldn't have access to Reddit data,” the company's chief legal officer said. Ben Lee“This is also a signal to bad actors that the word 'allow' in robots.txt does not, and never has, meant they can do whatever they want with the data,” he told me.

My colleagues David Pearce recently Robots.txt has been called “the text file that runs the Internet.” Since it was conceptualized in the early days of the web, this file has primarily controlled whether search engines like Google can crawl websites and index results. For the past 20 years or so, the give-and-take of Google getting the ability to crawl in exchange for sending traffic to them made sense for everyone involved. Then AI companies started taking all the data they could find online and training their models.

Source link