TY - JOUR
T1 - A study into patient similarity through representation learning from medical records
AU - Memarzadeh, Hoda
AU - Ghadiri, Nasser
AU - Samwald, Matthias
AU - Lotfi Shahreza, Maryam
PY - 2022
Y1 - 2022
N2 - Patient similarity assessment, which identifies patients similar to a given patient, is a fundamental component of many secondary uses of medical data. The assessment can be performed using electronic medical records (EMRs). Patient similarity measurement requires converting heterogeneous EMRs into comparable formats to calculate distance. This study presents a new data representation method for EMRs that considers the information in clinical narratives. To address the limitations of previous approaches in handling complex parts of EMR data, an unsupervised manner is proposed for building a patient representation, which integrates unstructured and structured data extracted from patients’ EMRs. We employed a tree structure to model the extracted data that capture the temporal relations of multiple medical events from EMR. We processed clinical notes to extract medical concepts using Python libraries such as MedspaCy and ScispaCy and mapped entities to the Unified Medical Language System (UMLS). To capture temporal aspects of the extracted events, we developed two new relabeling methods for the non-leaf nodes of the tree. To create an embedding vector for each patient, we traversed the tree to generate sequences that the Doc2vec algorithm would use. The comprehensive evaluation of the proposed method for patient similarity and mortality prediction tasks demonstrated that our proposed model leads to lower mean-squared error (MSE), higher precision, and normalized discounted cumulative gain (NDCG) relative to baselines.
AB - Patient similarity assessment, which identifies patients similar to a given patient, is a fundamental component of many secondary uses of medical data. The assessment can be performed using electronic medical records (EMRs). Patient similarity measurement requires converting heterogeneous EMRs into comparable formats to calculate distance. This study presents a new data representation method for EMRs that considers the information in clinical narratives. To address the limitations of previous approaches in handling complex parts of EMR data, an unsupervised manner is proposed for building a patient representation, which integrates unstructured and structured data extracted from patients’ EMRs. We employed a tree structure to model the extracted data that capture the temporal relations of multiple medical events from EMR. We processed clinical notes to extract medical concepts using Python libraries such as MedspaCy and ScispaCy and mapped entities to the Unified Medical Language System (UMLS). To capture temporal aspects of the extracted events, we developed two new relabeling methods for the non-leaf nodes of the tree. To create an embedding vector for each patient, we traversed the tree to generate sequences that the Doc2vec algorithm would use. The comprehensive evaluation of the proposed method for patient similarity and mortality prediction tasks demonstrated that our proposed model leads to lower mean-squared error (MSE), higher precision, and normalized discounted cumulative gain (NDCG) relative to baselines.
UR - https://hdl.handle.net/1959.7/uws:68871
U2 - 10.1007/s10115-022-01740-2
DO - 10.1007/s10115-022-01740-2
M3 - Article
SN - 0219-1377
VL - 64
SP - 3293
EP - 3324
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 12
ER -