How can developers train AI models without infringing copyright?

Building a large language model (LLM) requires hundreds of terabytes (if not petabytes) of training data. But where do you, as a developer, get all this data from? What can I do to prevent it from happening?

In some cases, AI developers have been found to have harvested or scraped hundreds of gigabytes of pirated ebooks, proprietary code, or personal data from online sources without the consent of their subjects or stakeholders. Given that his LLM standard today is being able to recite poetry, write Python, and explain quantum physics, there’s a competitive incentive for companies to build the biggest models possible. born.

Not only does this increase the likelihood of collecting copyrighted training data in a race to reach a certain number of parameters, but it also increases environmental damage and leads to inaccurate results. In many cases, what you want instead of LLM is a smart language model (SLM). These are models with a horizontal knowledge base that use a reasonable amount of ethically sourced training data, but are tailored to solve specific business problems.

Avoid copyrighted or illegal datasets

If you want to ensure that your AI models can weather the storm of AI regulation in the years to come, the easiest way to do that is to make sure you’ve researched and validated all your training data sources. That’s easier said than done.

The nature of the technical environment makes it much easier for hyperscalers like Amazon and Microsoft to build and train their own models. They have tons of user data collected from different departments of their business to feed their neural networks. For start-ups looking to find their market niche by training new models, collecting the same amount of data while avoiding copyrighted material may feel like an impossible task. yeah.

First, follow the normal procedure. Make sure you have the necessary permissions or licenses to access and use the datasets you choose, and set rules governing the collection and storage of user data.

Also, consider whether training your model using a smaller dataset or tweaking existing open-source alternatives would be a more effective solution. This makes it easier to collect enough data and verify its origin. While this model may not have the widest applicability of ChatGPT or Bard, it can be used as an opportunity to increase credibility for a particular domain or industry.

Of course there is another option. There are many issues with organic training data, including copyright, accuracy, and bias. As such, many in the AI community have become proponents of synthetic training data. If we can synthesize data for a specific problem, we can train the model with much higher accuracy while avoiding copyright issues entirely.

This kind of witty thinking is essential. After all, for every model called smart, its builders were even smarter in how they leveraged existing models, data points, and data analysis to prepare, scale, and manipulate data.

Think about the specific problem you want to solve, such as finding the right papers from a large body of scientific research, and then train your model on focused, labeled datasets from trusted sources in that domain. increase. Be open source academic research.

Again, the quality of your model is directly related to how smart you can be as a developer. The level of care and resourcefulness with which you acquire the data reflects the levelness and high quality you can expect from your model.

Avoid misinformation and inaccurate answers

Another advantage of curating a vetted, high-quality training data set is that users can trust the model to generate accurate and well-informed responses. This reduces the spread of misinformation and hallucinatory reactions.

Every day we read about models like ChatGPT and Bard producing inaccurate or completely wrong answers to questions. If you want to build models that are adaptive, efficient, and accurate that stand the test of time, fact verification should be an important part of your model’s architecture.

In order to prioritize accuracy and high-quality training, there is an opportunity to change the underlying mechanisms of neural networks. These models, to this day, have been built to collect a lot of information and spit it out in turn, but without any internal sense of how the two fit together.

We need to build models that are more selective in unsupervised learning, have better attention spans, and can focus more easily. Use an internal mechanism to filter the data before feeding it to the training process.

A smarter way to build language models

Today, LLMs built by hyperscalers are consuming the power and resources of small cities, and this is only increasing. Training GPT-3 alone takes 355 years of compute time and 284,000 kWh of energy on a single processor. This is 10 times faster than GPT-2. Aside from the harm this does to our planet, it’s highly inefficient. Upgrading the training process and narrowing down the list of specific use cases can help build future-proof and sustainable models.

If there are specific use cases where AI can be useful (for example, scanning new scientific patents for potential infringement), why should a model be able to recite Shakespeare? More Data does not always lead to better systems. Quality is much more important than quantity in specialized technical fields such as materials science and medical literature.

I have another idea to help you avoid copyright issues related to LLM training. Use a swarm of smart language model agents to address multiple aspects of a business problem, instead of contorting and permuting a single LLM to solve it, with more autonomy in how they achieve their goals. Think about how to make it relevant. All at once.

Industry leaders like Andrew Ng are calling for the development of “data-centric AI,” which focuses on engineering the data needed to build specific AI models. The move aims to enhance data quality and labeling to match the efficiency and methods of modern algorithms.

From a copyright standpoint, if you want to build your AI models in a way that avoids boiling water, stick to the basics and prioritize quality over quantity. Research your sources, understand how much data you need to collect for your particular use case, and create fact-based verification mechanisms to ensure accuracy.

Let’s work together to build smarter language models, not just large language models.

Lead image: dream studio

Iris.ai co-founder and CEO Anita Schjøll Brede will attend the Tech.eu Summit in Brussels on May 24th. Tickets are on sale now.

Source link