Achieve 10,000x training data reduction with high fidelity labels

Machine Learning


experiment

I wanted to understand which models and tasks would benefit most from the curation process. As Baseline In the experiment, we used crowdsourcing labels to fine-tune two LLMs of different sizes (Gemini Nano-1 with 1.8B parameters, and NANO-2 with 3.25B parameters) for two tasks of different complexity (based on expert alignment). Each crowdsourced dataset has a strong class of imbalance with ~100k annotations and a strong class of imbalance, with an average of around 95% benign labels.

Each of these four baseline conditions was compared with the corresponding one Curation Conditions where each model (NANO-1 and NANO-2) is fine-tuned in multiple rounds using the curation process above. For each iteration, we selected a set of curated examples and used them for model evaluation and fine-tuning as above. All models stopped before reaching comparable to the expert internal alignment, thus stopped at six iterations (~400 fine-tuning and ~250 evaluation samples) due to lower complexity and five iterations (~250 fine-tuning and ~150 evaluation samples). (Note that lower complexity tasks have more and more different examples, which may explain the long time required to converge.) Both datasets had positive examples with a balance of approximately 40% for the final class.

The following table provides an overview of the scale and quality of the data used in each condition. Experts reached the average pairwise Cohen kappa (lower complexity task) and .78 (upper complexity tasks) through the curation process. We consider these to be the ceiling of the model's performance. To assess the quality of crowdsourced data, crowdsourced annotations and kappa alignments between experts were calculated. This is based on a complete curation set of .59 (lower complexity) and .41 (higher complexity).



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *