TY - CHAP
T1 - MIRS
T2 - [MASK] Insertion Based Retrieval Stabilizer for Query Variations
AU - Liu, Junping
AU - Gong, Mingkang
AU - Hu, Xinrong
AU - Yang, Jie
AU - Guo, Yi
PY - 2023
Y1 - 2023
N2 - Pre-trained Language Models (PLMs) have greatly pushed the frontier of document retrieval tasks. Recent studies, however, show that PLMs are vulnerable to query variations, i.e., queries containing misspellings or word re-ordering of original queries, and etc. Despite the increasing interest to robustify the retriever performance, the impact of the query variations is not fully exploited. To effectively address this problem, this paper revisits the Masked-Language Modeling (MLM) and proposes a robust fine-tuning algorithm, termed [MASK] Insertion based Retrieval Stabilizer (MIRS). The proposed algorithm differs from existing methods via the injection of [MASK] tokens into query variations and further encouraging the representation similarity between the pair of original queries and their variations. In comparison to MLM, the traditional [MASK] substitution-then-prediction is less emphasized in MIRS. Additionally, an in-depth analysis of our algorithm is also provided to reveal: (1) the latent representation (or semantic) of the original query forms a convex hull, while the impact of the query variation is then quantified as a "distortion" to this hull via deviating the hull vertices; and (2) inserted [MASK] tokens play a significant role in enlarging the intersection between the newly-formed hull (after variations) and the original one, thereby preserving more semantic from original queries. With the proposed [MASK] injection, MIRS exhibits a relative 1.8 MRR@10 absolute point enhancement on average in the retrieval accuracy, verified using 5 baselines across 3 public datasets with 4 types of query variations. We also provide intensive ablation studies to investigate the hyperparameter sensitiveness, to breakdown the model into individual components to manifest their efficacy, and further, to evaluate the out-of-domain model generalizability.
AB - Pre-trained Language Models (PLMs) have greatly pushed the frontier of document retrieval tasks. Recent studies, however, show that PLMs are vulnerable to query variations, i.e., queries containing misspellings or word re-ordering of original queries, and etc. Despite the increasing interest to robustify the retriever performance, the impact of the query variations is not fully exploited. To effectively address this problem, this paper revisits the Masked-Language Modeling (MLM) and proposes a robust fine-tuning algorithm, termed [MASK] Insertion based Retrieval Stabilizer (MIRS). The proposed algorithm differs from existing methods via the injection of [MASK] tokens into query variations and further encouraging the representation similarity between the pair of original queries and their variations. In comparison to MLM, the traditional [MASK] substitution-then-prediction is less emphasized in MIRS. Additionally, an in-depth analysis of our algorithm is also provided to reveal: (1) the latent representation (or semantic) of the original query forms a convex hull, while the impact of the query variation is then quantified as a "distortion" to this hull via deviating the hull vertices; and (2) inserted [MASK] tokens play a significant role in enlarging the intersection between the newly-formed hull (after variations) and the original one, thereby preserving more semantic from original queries. With the proposed [MASK] injection, MIRS exhibits a relative 1.8 MRR@10 absolute point enhancement on average in the retrieval accuracy, verified using 5 baselines across 3 public datasets with 4 types of query variations. We also provide intensive ablation studies to investigate the hyperparameter sensitiveness, to breakdown the model into individual components to manifest their efficacy, and further, to evaluate the out-of-domain model generalizability.
KW - Document Retrieval
KW - Masked-Language Modeling
KW - Model Robustness
KW - Query Representation
KW - Query Variations
UR - https://www.scopus.com/pages/publications/85174701409
U2 - 10.1007/978-3-031-39847-6_31
DO - 10.1007/978-3-031-39847-6_31
M3 - Chapter
AN - SCOPUS:85174701409
SN - 9783031398469
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 392
EP - 407
BT - Database and Expert Systems Applications: 34th International Conference, DEXA 2023, Penang, Malaysia, August 28–30, 2023, Proceedings, Part I
A2 - Strauss, Christine
A2 - Amagasa, Toshiyuki
A2 - Kotsis, Gabriele
A2 - Khalil, Ismail
A2 - Tjoa, A Min
PB - Springer Nature Switzerland
CY - Switzerland
ER -