TY - JOUR
T1 - A workflow to create trait databases from collections of textual taxonomic descriptions
AU - Coleman, David
AU - Gallagher, Rachael V.
AU - Falster, Daniel
AU - Sauquet, Herve
AU - Wenk, Elizabeth
PY - 2023/12
Y1 - 2023/12
N2 - There is a wealth of information about the characteristics (traits) of organisms within collections of taxonomic descriptions of plants and animals called a ‘Flora’ or ‘Fauna’ of a region. However, such knowledge is usually encoded as text paragraphs, and is thus unavailable for immediate analysis. In order to make use of the knowledge embedded in taxon descriptions, text must be organised into standardised, queryable datasets. Despite the recent development of natural language processing (NPL) tools to analyse taxonomic descriptions to extract trait values, the complexity and specificity of these methods currently limits broad application. Accessible and flexible methods for extracting traits across large numbers of taxonomic descriptions are therefore needed. Here we present such an R-based workflow, which can be adapted for use on any organismal group using a language familiar to researchers in the biological sciences. We document a way to (1) assemble tens of thousands of taxonomic descriptions into a standardised format, (2) split the taxon descriptions into different topics, (3) extract trait values as defined by the user, and (4) assign traits described at the genus and family level to lower level taxa to maximise trait coverage. As a case study, we apply the workflow to a collection of taxonomic descriptions drawn from Australia's state and national floras and describe useful techniques for creating workflows and thereby research-grade trait datasets. Using this method, we were able to extract 615,812 trait values from 38 different plant traits. Trait data collated using this method are freely available as part of the AusTraits trait database and have already contributed to analyses in several scientific publications.
AB - There is a wealth of information about the characteristics (traits) of organisms within collections of taxonomic descriptions of plants and animals called a ‘Flora’ or ‘Fauna’ of a region. However, such knowledge is usually encoded as text paragraphs, and is thus unavailable for immediate analysis. In order to make use of the knowledge embedded in taxon descriptions, text must be organised into standardised, queryable datasets. Despite the recent development of natural language processing (NPL) tools to analyse taxonomic descriptions to extract trait values, the complexity and specificity of these methods currently limits broad application. Accessible and flexible methods for extracting traits across large numbers of taxonomic descriptions are therefore needed. Here we present such an R-based workflow, which can be adapted for use on any organismal group using a language familiar to researchers in the biological sciences. We document a way to (1) assemble tens of thousands of taxonomic descriptions into a standardised format, (2) split the taxon descriptions into different topics, (3) extract trait values as defined by the user, and (4) assign traits described at the genus and family level to lower level taxa to maximise trait coverage. As a case study, we apply the workflow to a collection of taxonomic descriptions drawn from Australia's state and national floras and describe useful techniques for creating workflows and thereby research-grade trait datasets. Using this method, we were able to extract 615,812 trait values from 38 different plant traits. Trait data collated using this method are freely available as part of the AusTraits trait database and have already contributed to analyses in several scientific publications.
UR - https://hdl.handle.net/1959.7/uws:72959
U2 - 10.1016/j.ecoinf.2023.102312
DO - 10.1016/j.ecoinf.2023.102312
M3 - Article
SN - 1574-9541
VL - 78
JO - Ecological Informatics
JF - Ecological Informatics
M1 - 102312
ER -