experiment
The experiment was conducted on four data sets. The three datasets correspond to one dataset with downstream generation and classification tasks. Generation tasks are usually more difficult than classification tasks. This is because the generation task is evaluated by the accuracy of the next token prediction, and synthetic data is required to hold fine-grained textual information from the private data. In contrast, classification tasks only require maintaining co-occurrence patterns between labels and words in private data.
Three generation tasks are chosen to cover a diverse set of practical scenarios: PubMed (medical paper summary), chatbot arena (human-machine interactions), and multi-session chat (daily human-human interactions). To assess the quality of the generated synthetic data, we trained a small downstream language model of synthetic data according to the AUG-PE setup and calculated the following token prediction accuracy with actual test data.
The classification task is performed on the OpenReview dataset. To assess the quality of the generated synthetic data, we train a classifier downstream of the synthetic data to calculate the classification accuracy of the actual test data.
Selected datasets were carefully analyzed to alleviate concerns about data contamination. Our analysis showed no overlap between pre-training data and downstream datasets.
