Masking the unknown: leveraging masked samples for enhanced Data Augmentation

Xun Yao, Zijian Huang, Xinrong Hu, Jie Yang, Yi Guo

Research output: Contribution to journalArticlepeer-review

Abstract

Data Augmentation (DA) has become a widely adopted strategy for addressing data scarcity in numerous NLP tasks, especially in scenarios with limited resources or imbalanced classes. However, many existing augmentation techniques rely on randomness or additional resources, presenting challenges in both performance and practical implementation. Furthermore, there is a lack of exploration into what constitutes effective augmentation. In this paper, we systematically evaluate existing DA methods across a comprehensive range of text-classification benchmarks. The empirical analysis highlights that the most significant change resulting from augmentation is observed in the data variance. This observation inspires the proposed approach, termed Mask-for-Data Augmentation (M4DA), which strategically masks tokens from original samples for augmentation. Specifically, M4DA consists of a Variance-Oriented Masker Module (VMM), which ensures an increase in data variances, and a Complexity-Enhanced Selection Module (CSM), designed to select the augmented sample with the highest semantic complexity. The effectiveness of the proposed method is empirically validated across various text-classification benchmarks, including scenarios with limited or full resources and imbalanced classes. Experimental results demonstrate considerable improvements over state-of-the-arts.
Original languageEnglish
Pages (from-to)3997-4010
Number of pages14
JournalProceedings of Machine Learning Research
Volume244
Publication statusPublished - 2024
Event40th Conference on Uncertainty in Artificial Intelligence, UAI 2024 - Barcelona, Spain
Duration: 15 Jul 202419 Jul 2024

Fingerprint

Dive into the research topics of 'Masking the unknown: leveraging masked samples for enhanced Data Augmentation'. Together they form a unique fingerprint.

Cite this