Abstract
This study investigates how document structure and preprocessing strategies influence the performance of Retrieval-Augmented Generation (RAG) pipelines in domain-specific question answering, using university curriculum handbooks as a case study. We evaluate multiple data representations, varying in chunking method, punctuation level, and structural format (HTML, plain text, JSON), within a privacy-preserving pipeline that combines semantic vector indexing with a lightweight, locally deployed LLM. A benchmark of curriculum-related queries, covering both simple lookups and multi-hop reasoning, was used to assess performance under controlled retrieval conditions. Results show that semantically coherent chunking, manual or LangChain-based, substantially improves accuracy, with gains exceeding 25 percentage points for medium-difficulty queries, while structurally dense formats such as raw HTML or JSON reduce performance. Fixed-effects modelling confirms chunking and natural-language rephrasing as significant positive predictors of correctness, whereas punctuation density and markup type show no measurable effect. These findings highlight the critical, yet often underexamined, role of input data structure in applied RAG pipelines for curriculum advising, where accurate retrieval is essential for tasks such as prerequisite checking, program planning, and subject eligibility. The study presents a reproducible framework for evaluating preprocessing decisions in academic QA systems and supports the development of reliable, privacy-compliant advising tools.
| Original language | English |
|---|---|
| Title of host publication | Data Science and Machine Learning: 23rd Australasian Conference, AusDM 2025, Brisbane, QLD, Australia, November 26–28, 2025, Proceedings |
| Editors | Quang Vinh Nguyen, Yuefeng Li, Paul Kwan, Yanchang Zhao, Yee Ling Boo, Richi Nayak |
| Place of Publication | Singapore |
| Publisher | Springer |
| ISBN (Electronic) | 9789819567867 |
| ISBN (Print) | 9789819567850 |
| Publication status | Published - 2026 |
| Event | Australasian Conference on Data Science and Machine Learning - Brisbane, Australia Duration: 26 Nov 2025 → 28 Nov 2025 Conference number: 23rd |
Publication series
| Name | Communications in Computer and Information Science |
|---|---|
| Volume | 2765 |
| ISSN (Print) | 1865-0937 |
| ISSN (Electronic) | 1865-0929 |
Conference
| Conference | Australasian Conference on Data Science and Machine Learning |
|---|---|
| Abbreviated title | AusDM |
| Country/Territory | Australia |
| City | Brisbane |
| Period | 26/11/25 → 28/11/25 |
Keywords
- LLM RAG
- document structuring
- course advising