Cohere for AI Powers Large-Scale Language Models (LLM) with Active Inheritance: Controlling Synthetic Data Generation for Optimal Performance and Reduced Bias

Machine Learning


https://arxiv.org/abs/2407.01490

Synthetic data generation has gained popularity in the field of machine learning. This technique creates massive datasets when real-world data is limited and expensive. By generating synthetic data, researchers can train machine learning models more effectively and improve their performance in various applications. The generated data is crafted to exhibit certain characteristics that are beneficial to the model's learning process.

However, integrating synthetic data into machine learning models poses several challenges, especially with regards to biases and attributes that synthetic data may introduce. Understanding how these inherited characteristics affect the behavior and performance of large-scale language models (LLMs) is crucial. A primary concern is whether synthetic data may introduce unintended biases or other attributes that may affect the model output. This understanding is essential to ensure that models trained on synthetic data are effective and fair, and that negative characteristics from the data generation process are not perpetuated.

Current methods for optimizing the data space include data augmentation, pseudo-labeling, data weighting, data pruning, and curriculum learning. Data augmentation extends the dataset by creating modified versions of existing data. Pseudo-labeling generates labels for unlabeled data, effectively augmenting the dataset. Data weighting assigns different importance to different data points, and data pruning improves the quality of the remaining dataset by removing useless data. Curriculum learning structures the training process by gradually introducing more complex data. While these methods are useful, they are limited by characteristics inherent to the initial dataset. They often require the introduction of new desirable attributes, limiting their effectiveness in optimizing a model for specific characteristics.

Cohere for AI and Cohere researchers say:Active inheritanceThis method aims to intentionally direct synthetic data generation towards specific, non-differentiable objectives, such as high lexical diversity or low toxicity. By guiding the data generation process, researchers can directly influence the properties of the resulting model. Active inheritance involves selecting proxy labels based on desired characteristics, generating multiple examples per prompt, and selecting the examples that maximize the desired attributes. This approach, called targeted sampling, allows models to be fine-tuned towards specific goals using synthetic datasets curated to enrich for these attributes.

Active inheritance methods have shown great potential. For example, targeted sampling effectively steers model behavior towards desirable attributes, resulting in significant improvements. Models have improved length by up to 116% and linguistic diversity by 43%. Additionally, the method reduced toxicity by up to 40%. These results demonstrate the potential of active inheritance to improve the quality and safety of language models. By focusing on specific properties, researchers can ensure that models exhibit desirable characteristics and minimize negative ones.

The study also explored how passive inheritance, where a model inherits characteristics from synthetic data without explicit instructions, affects model performance. The study highlighted that models are sensitive to characteristics of the artificial data used for training, even when the data prompt appears neutral. This sensitivity raises concerns that unintended biases or attributes may be introduced into the model. This finding highlights the importance of carefully managing synthetic data to avoid undesirable outcomes.

In conclusion, this study highlights that synthetic data has a significant impact on the attributes of large-scale language models. By introducing the concept of active inheritance, Cohere researchers provided a robust framework to guide synthetic data generation towards desired characteristics. This method enriches specific attributes such as lexical diversity and reduced toxicity, ensuring that models trained on synthetic data are effective and safe. The results of the study show that it is possible to successfully and efficiently incorporate desired attributes into model generation with minimal effort. Active inheritance is a promising approach to optimizing machine learning models, providing a path towards more sophisticated and reliable AI systems.


Please check paperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter.

participate Telegram Channel and LinkedIn GroupsUp.

If you like our work, you will love our Newsletter..

Please join us 46k+ ML Subreddit

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His latest endeavor is the launch of Marktechpost, an Artificial Intelligence media platform. The platform stands out for its in-depth coverage of Machine Learning and Deep Learning news in a manner that is technically accurate yet easily understandable to a wide audience. The platform has gained popularity among its audience with over 2 million views every month.

🐝 Join the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft & more…





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *