Microsoft discontinues AI tutorials for AI training using pirated copies of Harry Potter

AI Basics


TL;DR

  • Blog deleted: Microsoft has removed a developer tutorial after a viral hacker news thread revealed that it instructs users to train an AI with pirated Harry Potter books.
  • 15 months live: This post was published by Pooja Kamith, Microsoft Senior Product Manager, in November 2024 and was published for 15 months without any internal copyright review.
  • 10,000 downloads: A linked Kaggle dataset containing all seven Harry Potter books accumulated over 10,000 downloads while the tutorial was live.
  • Legal exposure: The law professor points to Microsoft’s potential contributory copyright liability, saying statutory damages for willful infringement can reach up to $150,000 per work.
  • Corporate contradiction: That same month, Microsoft entered into a licensing book deal with HarperCollins, revealing discrepancies in the company’s copyright practices.


Microsoft on Thursday removed a developer blog that instructed users to train an AI with pirated Harry Potter books, less than 24 hours after a Hacker News thread pointing out copyright issues went viral. This tutorial started with Harry imagining himself on the Hogwarts Express being pitched by a friend about the capabilities of Azure SQL, but it was published for 15 months without any internal copyright review.

This post was written by Pooja Kamith, a Senior Product Manager with over 10 years at Microsoft. She trained a model on pirated books to generate fan fiction, created an AI image of Harry Potter stamped with the Microsoft logo, and published a tutorial on the Azure SQL developer blog in November 2024. The takedown is the latest episode in Microsoft’s increasing exposure to piracy lawsuits.

Tutorial based on pirated books

The guides archived here themselves do not indicate that they were created based on controversial material. The tutorial, titled “LangChain Integration for Vector Support for SQL-based AI Applications,” provided step-by-step instructions for downloading a Kaggle dataset containing all seven Harry Potter books, uploading a text file to Azure Blob Storage, and training a question answering model using Azure SQL DB and LangChain. Less than 24 hours after a Hacker News thread pointing out copyright issues went viral, Microsoft deleted the post.

Shubham Maindola, an Indian data scientist who is apparently not affiliated with Microsoft, uploaded the Kaggle dataset linked in the tutorial. He downloaded Harry Potter e-books, converted them to text files, and labeled the collection as public domain, which was an incorrect classification. JK Rowling’s Harry Potter series remains fully under copyright.

However, Maindra told Ars Technica that the dataset “Incorrectly marked as public domain” There was no intention to misrepresent the license status. Kaggle did not respond to requests for comment.



Source link