Abstract
Data Augmentation (DA) has become a widely adopted strategy for addressing data scarcity in numerous NLP tasks, especially in scenarios with limited resources or imbalanced classes. However, many existing augmentation techniques rely on randomness or additional resources, presenting challenges in both performance and practical implementation. Furthermore, there is a lack of exploration into what constitutes effective augmentation. In this paper, we systematically evaluate existing DA methods across a comprehensive range of text-classification benchmarks. The empirical analysis highlights that the most significant change resulting from augmentation is observed in the data variance. This observation inspires the proposed approach, termed Mask-for-Data Augmentation (M4DA), which strategically masks tokens from original samples for augmentation. Specifically, M4DA consists of a Variance-Oriented Masker Module (VMM), which ensures an increase in data variances, and a Complexity-Enhanced Selection Module (CSM), designed to select the augmented sample with the highest semantic complexity. The effectiveness of the proposed method is empirically validated across various text-classification benchmarks, including scenarios with limited or full resources and imbalanced classes. Experimental results demonstrate considerable improvements over state-of-the-arts.
Original language | English |
---|---|
Pages (from-to) | 3997-4010 |
Number of pages | 14 |
Journal | Proceedings of Machine Learning Research |
Volume | 244 |
Publication status | Published - 2024 |
Event | 40th Conference on Uncertainty in Artificial Intelligence, UAI 2024 - Barcelona, Spain Duration: 15 Jul 2024 → 19 Jul 2024 |