The effect on accuracy of tweet sample size for hashtag segmentation dictionary construction

Laurence A. F. Park, Glenn Stone

Research output: Chapter in Book / Conference PaperConference Paperpeer-review

Abstract

![CDATA[Automatic hashtag segmentation is used when analysing twitter data, to associate hashtag terms to those used in common language. The most common form of hashtag segmentation uses a dictionary with a probability distribution over the dictionary terms, constructed from sample texts specific to the given hashtag domain. The language used in Twitter is different to the common language found in published literature, most likely due to the tweet character limit, therefore dictionaries constructed to perform hashtag segmentation should be derived from a random sample of tweets. We ask the question “How large should our sample of tweets be to obtain a given level of segmentation accuracy?”We found that the Jaccard similarity between the correct segmentation and the predicted segmentation using a unigram model, follows a Zero-One inflated Beta distribution with four parameters. We also found that each of these four parameters are functions of the sample size (tweet count) for dictionary construction, implying that we can compute the Jaccard similarity distribution once the tweet count of the dictionary is known. Having this model allows us to compute the number of tweets required for a given level of hashtag segmentation accuracy, and also allows us to compare other segmentation models to this known distribution.]]
Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining: 20th Pacific-Asia Conference, PAKDD 2016: Auckland, New Zealand, April 19-22, 2016: Proceedings, Part I
PublisherSpringer
Pages382-394
Number of pages13
ISBN (Print)9783319317526
DOIs
Publication statusPublished - 2016
EventPacific-Asia Conference on Knowledge Discovery and Data Mining -
Duration: 19 Apr 2016 → …

Publication series

Name
ISSN (Print)0302-9743

Conference

ConferencePacific-Asia Conference on Knowledge Discovery and Data Mining
Period19/04/16 → …

Keywords

  • computational linguistics
  • data mining
  • dictionaries
  • online social networks

Fingerprint

Dive into the research topics of 'The effect on accuracy of tweet sample size for hashtag segmentation dictionary construction'. Together they form a unique fingerprint.

Cite this