A framework for scalable similarity evaluation in text graphs

Mahdi Samani, Nasser Ghadiri

Research output: Chapter in Book / Conference PaperConference Paperpeer-review

Abstract

Graphs and graph databases are applicable over a wide range of domains, including text mining and web mining. Using graphs to represent relationships between entities provides enriched models for emerging tasks of web search and information retrieval. Natural language processing algorithms use graphs to model structural relationships of texts efficiently, resulting in improved performance. However, the need to increase the accuracy of graph construction and weight allocation remains a fundamental challenge. Existing methods for these tasks provide limited efficiency and lack scalability for large graphs. In this study, we propose a novel graph-based method for text modeling and running a query to evaluate the similarity of text segments. In this method, the graph corresponding to the text is first created by modeling words and named entities by the state-of-the-art pre-trained BERT model. Graph nodes are then weighted in two stages. In the first stage, the nodes with more generalization obtain higher weights. The second weighting stage is done by the graph obtained from the query text. In this weighting step, nodes are considered important if they are specifically related to the query text. After determining the important nodes in the graph, the semantic similarity between the query text and the texts in the database is measured. The whole process of this framework uses a natural language processing pipeline in Apache Spark scalable platform. The efficiency of the model was evaluated for both distributed and non-distributed configuration and its scalability on a Spark cluster. Evaluation of the accuracy using the Pearson correlation coefficient shows that the proposed method performs higher performance than its competitors.
Original languageEnglish
Title of host publicationProceedings of the 7th International Conference on Web Research (ICWR), 19-20 May 2021, Tehran, Iran
PublisherIEEE
Pages182-190
Number of pages9
ISBN (Print)9781665404266
DOIs
Publication statusPublished - 19 May 2021
EventInternational Conference on Web Research -
Duration: 19 May 2021 → …

Conference

ConferenceInternational Conference on Web Research
Period19/05/21 → …

Bibliographical note

Publisher Copyright:
© 2021 IEEE.

Fingerprint

Dive into the research topics of 'A framework for scalable similarity evaluation in text graphs'. Together they form a unique fingerprint.

Cite this