Microsoft discontinues AI tutorials for AI training using pirated copies of Harry Potter

TL;DR

Blog deleted: Microsoft has removed a developer tutorial after a viral hacker news thread revealed that it instructs users to train an AI with pirated Harry Potter books.

15 months live: This post was published by Pooja Kamith, Microsoft Senior Product Manager, in November 2024 and was published for 15 months without any internal copyright review.

10,000 downloads: A linked Kaggle dataset containing all seven Harry Potter books accumulated over 10,000 downloads while the tutorial was live.

Legal exposure: The law professor points to Microsoft’s potential contributory copyright liability, saying statutory damages for willful infringement can reach up to $150,000 per work.

Corporate contradiction: That same month, Microsoft entered into a licensing book deal with HarperCollins, revealing discrepancies in the company’s copyright practices.

Microsoft on Thursday removed a developer blog that instructed users to train an AI with pirated Harry Potter books, less than 24 hours after a Hacker News thread pointing out copyright issues went viral. This tutorial started with Harry imagining himself on the Hogwarts Express being pitched by a friend about the capabilities of Azure SQL, but it was published for 15 months without any internal copyright review.

This post was written by Pooja Kamith, a Senior Product Manager with over 10 years at Microsoft. She trained a model on pirated books to generate fan fiction, created an AI image of Harry Potter stamped with the Microsoft logo, and published a tutorial on the Azure SQL developer blog in November 2024. The takedown is the latest episode in Microsoft’s increasing exposure to piracy lawsuits.

Tutorial based on pirated books

The guides archived here themselves do not indicate that they were created based on controversial material. The tutorial, titled “LangChain Integration for Vector Support for SQL-based AI Applications,” provided step-by-step instructions for downloading a Kaggle dataset containing all seven Harry Potter books, uploading a text file to Azure Blob Storage, and training a question answering model using Azure SQL DB and LangChain. Less than 24 hours after a Hacker News thread pointing out copyright issues went viral, Microsoft deleted the post.

Shubham Maindola, an Indian data scientist who is apparently not affiliated with Microsoft, uploaded the Kaggle dataset linked in the tutorial. He downloaded Harry Potter e-books, converted them to text files, and labeled the collection as public domain, which was an incorrect classification. JK Rowling’s Harry Potter series remains fully under copyright.

However, Maindra told Ars Technica that the dataset “Incorrectly marked as public domain” There was no intention to misrepresent the license status. Kaggle did not respond to requests for comment.

This mislabeling illustrates how AI copyright risks propagate through the developer ecosystem. A pirated collection incorrectly labeled as “public domain” by one contributor becomes a de facto training resource for thousands of people when a major company’s branded tutorial is linked as authoritative.

Microsoft’s own Harry Potter upload

Beyond the Kaggle link, the tutorial’s working demo was built on Microsoft’s own Azure dataset containing Harry Potter and the Philosopher’s Stone. As a Hacker News commenter pointed out, another Azure GitHub sample in the azure-sample repository also included the similarly copyrighted Foundation series.

Additionally, copyright review failures across two repositories were not limited to a single author’s error in judgment. In the case of a developer advocacy effort that produces large-scale technical tutorials, the absence of internal flags for 15 months indicates a structural gap in content reviews, rather than an individual employee error.

15 months, 10,000 downloads

This structural gap had visible effects over time. As reported by Ars Technica, Microsoft’s Azure SQL blog was live for more than 15 months, from November 2024 to February 2026, during which time the linked Kaggle dataset was downloaded more than 10,000 times. At no time during that period was this post flagged in any internal compliance review.

But that oversight was only revealed by external pressure. Ars Technica contacted Maindola directly on February 20, and the Kaggle dataset was removed the same day before Microsoft took action on its own. The post itself followed a few hours later.

For over a year, the company’s developer advocacy infrastructure directed users to pirated content without any internal enforcement. This failure becomes even more remarkable when you consider what Microsoft was doing in parallel.

Corporate double standards

This incident contrasts with Microsoft’s own published findings, revealing a notable discrepancy. In 2023, Microsoft Research published a paper entitled “Who is Harry Potter? Approximate Unlearning in LLMs” citing copyright liability as a motive. The team demonstrated how a model that absorbed a book could release its knowledge in approximately 1 GPU time, and published the tuned model on HuggingFace.

In November 2024, the same month that the Kamath tutorial was published, Microsoft signed a licensing book deal with HarperCollins for AI training. This arrangement explicitly assumed that rights would be cleared before use. This transaction raises certain legal complications. Evidence of knowledge of the license is relevant to the determination of willfulness in determining whether statutory damages reach the maximum limits for willful infringement. Kamath’s post, published the same month, directed users to the same book through an unauthorized pirated dataset.

Meanwhile, courts have generally held that AI training on copyrighted material may qualify as fair use, but it remains open whether pirated source material specifically changes that analysis. Separately, Microsoft is facing legal challenges over its use of pirated books to train its Megatron models, part of a broader lawsuit that has resulted in 75 lawsuits against AI companies since 2022, including Meta, OpenAI, Google, and Microsoft.

legal exposure

In the context of such litigation, the specific facts here have clear legal weight. There are multiple levels of responsibility, said Cathy Smith, a law professor at Chicago-Kent who co-directs the school’s intellectual property program.

Specifically, U.S. copyright law allows for statutory damages of up to $150,000 per work for willful infringement. Seven Harry Potter books are targeted, presenting a significant potential risk.

However, Smith acknowledged that developers may reasonably rely on datasets labeled as “public domain.” If you’re a technical or literary expert, you may have no idea how long a copyright lasts, especially if a reputable company has marked the material as freely available. However, this good faith discussion does not resolve the secondary exposure of Microsoft as the tutorial publisher. Mr Smith identified contributory liability as a core risk.

“The end result is they create infringing material by saying, ‘Hey, here you go, take that infringing material and use it in our system.’ They can download it and use it to encourage others to use it for training purposes, potentially incurring some type of secondary contributory liability for copyright infringement.”

Cathay YN Smith, Co-Director of the Intellectual Property Law Program at Chicago-Kent Law School (via Ars Technica)

Mr. Smith also flagged fan fiction output as a separate concern. An AI model that generates content based on Rowling’s characters and plot sequences could potentially reproduce elements of expression that are specifically protected by copyright.

Furthermore, Mr. Smith pointed out that: “Both regurgitation and fanfiction creation can be copyright issues.” and its reproduced output “It may be infringing.”

This dual liability distinguishes this case from AI copyright disputes that focus solely on training data. If both routes are upheld in court, the damage will be even greater for Microsoft and the developers who followed the tutorial into products.

Who is responsible?

Furthermore, assigning responsibility involves overlapping failures. Mr. Smith frankly explained why a technically skilled Microsoft employee turned to Harry Potter rather than a truly public domain text.

“If I had cleared this for Microsoft, I would have been concerned, but at the same time, I completely understand what this employee was doing. No one wants to write fan fiction about books that are in the public domain.”

Smith, co-director of the Chicago-Kent Intellectual Property Law Program (via Ars Technica)

Meanwhile, a person on Hacker News who identified himself as a former Microsoft employee said the incident was a bad decision and pointed out that the post was deleted as soon as someone noticed. The commenter also said that Microsoft allows employees to post blogs without editorial review, and if this policy is accurate, the issue shifts from individual negligence to systemic weaknesses in the way the company vets pro-developer content.

Microsoft declined to comment and had not issued a statement explaining the editorial failure at the time of publication.

This silence leaves similar questions about those who followed the tutorial. The more than 10,000 developers who trained their AI models on pirated datasets are now in the same uncertain legal territory as Microsoft, and no rights holders have yet launched a counter-attack. As AI copyright litigation continues to grow, is there any chance that any will will be resolved?

Smith said: “I’m concerned about it, but I’m not saying it’s automatically a violation.” For now, developer advocacy blogs promoting AI capabilities on Kaggle datasets, and the developers following their instructions, are in a legal environment that courts have not yet fully mapped.

Source link