TY - JOUR
T1 - Efficient storage and retrieval of probabilistic latent semantic information for information retrieval
AU - Park, Laurence A. F.
AU - Ramamohanarao, Kotagiri
PY - 2009
Y1 - 2009
N2 - Probabilistic latent semantic analysis (PLSA) is a method for computing term and document relationships from a document set. The probabilistic latent semantic index (PLSI) has been used to store PLSA information, but unfortunately the PLSI uses excessive storage space relative to a simple term frequency index, which causes lengthy query times. To overcome the storage and speed problems of PLSI, we introduce the probabilistic latent semantic thesaurus (PLST); an efficient and effective method of storing the PLSA information. We show that through methods such as document thresholding and term pruning, we are able to maintain the high precision results found using PLSA while using a very small percent (0.15%) of the storage space of PLSI.
AB - Probabilistic latent semantic analysis (PLSA) is a method for computing term and document relationships from a document set. The probabilistic latent semantic index (PLSI) has been used to store PLSA information, but unfortunately the PLSI uses excessive storage space relative to a simple term frequency index, which causes lengthy query times. To overcome the storage and speed problems of PLSI, we introduce the probabilistic latent semantic thesaurus (PLST); an efficient and effective method of storing the PLSA information. We show that through methods such as document thresholding and term pruning, we are able to maintain the high precision results found using PLSA while using a very small percent (0.15%) of the storage space of PLSI.
KW - information retrieval
KW - search engines
UR - http://handle.uws.edu.au:8081/1959.7/502762
U2 - 10.1007/s00778-008-0093-2
DO - 10.1007/s00778-008-0093-2
M3 - Article
SN - 1066-8888
VL - 18
SP - 141
EP - 155
JO - VLDB Journal
JF - VLDB Journal
IS - 1
ER -