TY - GEN
T1 - The effect on accuracy of tweet sample size for hashtag segmentation dictionary construction
AU - Park, Laurence A. F.
AU - Stone, Glenn
PY - 2016
Y1 - 2016
N2 - ![CDATA[Automatic hashtag segmentation is used when analysing twitter data, to associate hashtag terms to those used in common language. The most common form of hashtag segmentation uses a dictionary with a probability distribution over the dictionary terms, constructed from sample texts specific to the given hashtag domain. The language used in Twitter is different to the common language found in published literature, most likely due to the tweet character limit, therefore dictionaries constructed to perform hashtag segmentation should be derived from a random sample of tweets. We ask the question “How large should our sample of tweets be to obtain a given level of segmentation accuracy?”We found that the Jaccard similarity between the correct segmentation and the predicted segmentation using a unigram model, follows a Zero-One inflated Beta distribution with four parameters. We also found that each of these four parameters are functions of the sample size (tweet count) for dictionary construction, implying that we can compute the Jaccard similarity distribution once the tweet count of the dictionary is known. Having this model allows us to compute the number of tweets required for a given level of hashtag segmentation accuracy, and also allows us to compare other segmentation models to this known distribution.]]
AB - ![CDATA[Automatic hashtag segmentation is used when analysing twitter data, to associate hashtag terms to those used in common language. The most common form of hashtag segmentation uses a dictionary with a probability distribution over the dictionary terms, constructed from sample texts specific to the given hashtag domain. The language used in Twitter is different to the common language found in published literature, most likely due to the tweet character limit, therefore dictionaries constructed to perform hashtag segmentation should be derived from a random sample of tweets. We ask the question “How large should our sample of tweets be to obtain a given level of segmentation accuracy?”We found that the Jaccard similarity between the correct segmentation and the predicted segmentation using a unigram model, follows a Zero-One inflated Beta distribution with four parameters. We also found that each of these four parameters are functions of the sample size (tweet count) for dictionary construction, implying that we can compute the Jaccard similarity distribution once the tweet count of the dictionary is known. Having this model allows us to compute the number of tweets required for a given level of hashtag segmentation accuracy, and also allows us to compare other segmentation models to this known distribution.]]
KW - computational linguistics
KW - data mining
KW - dictionaries
KW - online social networks
UR - http://handle.uws.edu.au:8081/1959.7/uws:35761
UR - http://pakdd16.wordpress.fos.auckland.ac.nz/
U2 - 10.1007/978-3-319-31753-3_31
DO - 10.1007/978-3-319-31753-3_31
M3 - Conference Paper
SN - 9783319317526
SP - 382
EP - 394
BT - Advances in Knowledge Discovery and Data Mining: 20th Pacific-Asia Conference, PAKDD 2016: Auckland, New Zealand, April 19-22, 2016: Proceedings, Part I
PB - Springer
T2 - Pacific-Asia Conference on Knowledge Discovery and Data Mining
Y2 - 19 April 2016
ER -