{"paper_id":"e1c8e4a0-cd62-4641-ba59-8365a7eb506e","body_text":"1 \nFrom literature to biodiversity data: mining arthropod organismal \nand ecological traits with machine learning \n \nJoseph Cornelius 1,2, Harald Detering 1,3, Oscar Lithgow-Serrano 1,2, Donat Agosti 4, Fabio Rinaldi 1,2, Robert M. \nWaterhouse*,1,3 \n \n1 SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland \n2 Dalle Molle Institute for Artificial Intelligence Research, Scuola Universitaria Professionale della Svizzera Italiana, Univers ita della \nSvizzera Italiana, Lugano, Switzerland \n3 Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland \n4 Plazi, Bern, Switzerland \n* Corresponding author: robert.waterhouse@sib.swiss \n \nAbstract \nThe fields of taxonomy and biodiversity research have witnessed an exponential growth in published \nliterature. This vast corpus of articles holds information on the diverse biological traits of organisms \nand their ecologies. However, access to and extraction of relevant data from this extensive resource \nremain challenging. Advances in text and data mining (TDM) and Natural Language Processing \n(NLP) techniques offer new opportunities for liberating such information from the literature. Testing \nand using such approaches to annotate articles in machine actionable formats is therefore \nnecessary to enable the exploitation of existing knowledge in new biology, ecology, and evolution \nresearch. Here we explore the potential of these methods to annotate and extract organismal and \necological trait data for the most diverse animal group on Earth, the arthropods. The article \nprocessing workflow uses manually curated trait dictionaries with trained NLP models to perform \nlabelling of entities and relationships of thousands of articles. A subset of manually annotated \ndocuments facilitated the formal evaluation of the performance of the workflow in terms of entity \nrecognition and normalisation, and relationship extraction, highlighting several important technical \nchallenges. The results are made available to the scientific community through an interactive web \ntool and queryable resource, the ArTraDB Arthropod Trait Database. These methodological \nexplorations provide a framework that could be extended beyond the arthropods, where TDM and \nNLP approaches applied to the taxonomy and biodiversity literature will greatly facilitate data \nsynthesis studies and literature reviews, the identification of knowledge gaps and biases, as well as \nthe data-informed investigation of ecological and evolutionary trends and patterns. \n \nKeywords \nArthropods; Biodiversity; Natural Language Processing; Text and Data Mining; Trait Database \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n2 \nIntroduction \n \nThe existing detailed knowledge on biodiversity and the natural world is contained largely in the \nform of an extensive and growing corpus of scientific publications (McCallen et al. 2019). This \nknowledge harbours important insights to better understand the dynamics and dimensions of \nmajor challenges facing our planet today, such as the global biodiversity crisis and the impact of \nclimate change on the distribution of species. Large portions of that information have been \ndifficult to access because they are unstructured, in printed formats, including portable data \nformat (PDF), which have been difficult to machine operate, or which are behind paywalled \naccess. With increasing digitisation of the scientific publishing process, and thanks to \ncomprehensive digitisation efforts of natural history collections (Hedrick et al. 2020), an \nincreasing number of documents have become digitally accessible. For example, \nPubMedCentral\n (PMC) alone contains millions of machine actionable articles (Rosonovski et al. \n2023), including millions of supplementary data files, and tens of millions of abstracts are \naccessible through PubMed. These documents can be used for text and data mining (TDM) and \nnatural language processing (NLP) to better annotate the literature, offering opportunities to \nliberate data and knowledge from publications. However, even though such literature mining \napproaches are recognised as important tools in biology, ecology, and evolution research, their \npotential is currently far from fully realised (Farrell et al. 2022, 2024). \n \nTo focus the methodological explorations of these opportunities, aiming to annotate and extract \norganismal and ecological trait data for the most numerous and diverse animal group on Earth \nwith estimates of 6.8 million terrestrial species (Stork 2018), the phylum Arthropoda represents \nan excellent case study. Arthropods have fascinated researchers and amateur entomologists for \ncenturies, leading to a vast accumulation of knowledge about how countless evolutionary \nadaptations have enabled them to exploit so many ecological niches (Grimaldi and Engel 2005). \nThe most extensive knowledge is often biased towards the more charismatic species and those \nthat serve as models in research, or that impact human health and agriculture. The decades of \naccrued learning also present substantial variability in terms of what types of trait data have \nbeen collected and with which methodologies (Wong et al. 2019). More recently, this biological \nknowledge is being extended through the acquisition of increasing amounts of genomics data to \nexplore the genetics underlying these traits (Feron and Waterhouse 2022b). The motivation is to \nunderstand how genetic and genomic changes relate to observable phenotypic differences, i.e. \ntraits, amongst species. Thanks to ongoing developments in bioinformatics, comparative \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n3 \ngenomics analyses are generally scalable to increasingly larger datasets, taking advantage of \nrapidly accumulating numbers of species with sequenced genomes (Feron and Waterhouse \n2022a). However, this is not being matched by equivalent advances in the cataloguing of \nspecies traits for which manual collection and curation efforts cannot keep pace with the needs \nto access trait data for large-scale quantitative analyses. Literature mining offers potential \nsolutions to overcoming these challenges, for example, building a database of insect egg size \nand shape for more than 6’700 species relied on information extracted from 1’756 publications \n(Church et al. 2019), cataloguing traits of 12’448 butterfly and moth species involved extracting \ninformation from 117 field guides and species accounts (Shirey et al. 2022), and compiling an \nexpert-curated trait database of 520 subterranean spiders examined 255 taxonomic descriptions \nfrom the World Spider Catalog and the Spiders of Europe repository (Mammola et al. 2022). \nTherefore, efforts to develop systematic approaches for mining the literature to build \ncomprehensive open databases of species’ organismal and ecological traits should increasingly \nprovide researchers with direct access to much-needed large-scale biodiversity data. \n \nHere, we present a TDM and NLP framework for the automated labelling of  identified arthropod \nspecies (taxa), their organismal and ecological traits, and the associated trait values in \ntaxonomic and biodiversity research articles. We focus on PMC articles containing taxonomic \ntreatments of arthropods, i.e.  structured sections of publications that describe and define the \nname and features of species, leveraging the large resource of Plazi’s TreatmentBank (Guidoti \net al. 2021). Using manually curated trait dictionaries, as well as subsets of manually annotated \narticles, we trained NLP models to perform labelling of entities (taxa, traits, values) and \nrelationships (taxon to trait, trait to value) of thousands of articles. We formally evaluated the \nperformance of our approaches, demonstrating their application to 2’000 publications, which \nproduced 656’403 entity and 339’463 relationship annotations, and highlighting several \nimportant technical challenges. Finally, we developed an interactive web tool that makes the \nresults available to the scientific community in the form of the queryable database resource, the \nArTraDB Arthropod Trait Database. Together, these tools and resources serve to advance the \nuse of literature mining approaches in biology, ecology, and evolution research, by semi-\nautomating the building of comprehensive open databases of organismal and ecological traits \nextracted from the literature. \n \n \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n4 \nMaterials & Methods \n \nSourcing and Processing the Text Corpora \n \nArticles Sourced from PubMe dCentral vi a TreatmentBank  \nTo take advantage of the normally highly structured and detailed species information found in \ntaxonomic treatment texts, and at the same time to reduce the complexity of the overall labelling \nprocess, the initial corpus of texts was defined using all arthropod species taxonomic treatment \ntexts available from Plazi’s TreatmentBank (Guidoti et al. 2021). Taxonomic treatments refer to \nsections in scientific publications where the key features describing, distinguishing, and naming \na species are documented (Agosti et al. 2022). Treatments have been the building blocks of \nhow data about taxa are provided ever since the beginning of modern taxonomy, and usually \nfollow a highly structured format. This simplifies the first task of labelling taxa (species) because \neach text already pre-processed by Plazi is directly linked to a known species, meaning at least \none annotated taxon in the document should match the linked species name provided by \nTreatmentBank. From the ~310’000 treatment texts sourced from Plazi’s TreatmentBank, \n~250’000 were linked to digital object identifiers (DOIs) comprising ~24’000 unique publications, \n3’650 of which were linked with PubMedCentral (PMC) identifiers and thereby presented \npublicly accessible texts that could be used for labelling and subsequent mining. Note that \npublications may, and often do, contain many treatments, i.e. descriptions of many species, \nhence the tenfold higher number of treatments compared to the number of publications. \n \nProcessing o f PubMe dCen tral Articles  \nThe PMC article files were retrieved in Extensible Markup Language (XML) format and \nsubsequently transformed into plain text format, maintaining the original text extraction without \nfurther modifications such as lowercasing. For the Named Entity Recognition (NER) task, these \nPMC text files were subsequently converted into the CoNLL format (Kim Sang and De Meulder \n2003) (a text file with one word per line with sentences separated by an empty line) using the \nIOB2 tagging scheme (Inside–Outside–Beginning) (Ramshaw and Marcus 1999). For the \nRelationship Extraction (RE) task, the same PMC text files were processed into a specialised \nJSON (JavaScript Object Notation) file format compatible with the “Language Understanding \nwith Knowledge-based Embeddings” model (LUKE) (Yamada et al. 2020). This format splits the \ntext up in context windows, which by default encompass six sentences, along with the offsets \nand labels for both the head and tail entities. \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n5 \n \n \nSourcing and Processing Taxonomy and Trait Data \n \nTaxon omy Data Source d from the Ca talo g ue of Life  \nThe Catalogue of Life (COL) represents an authoritative source of taxonomic data built and \nmaintained through a long-term international collaboration of taxonomists and informaticians \n(Bánki et al. 2024). The COL was therefore selected as the reference taxonomy dataset for \nbuilding a dictionary of arthropod taxa and for filtering the treatments to select only arthropod \nspecies for downstream processing. The monthly COL release of July 2022 (COL Version: \n2022-06-23) was processed to extract all accepted taxa (dwc:taxonomicStatus == ‘accepted’) \nthat are hierarchically below Arthropoda (dwc:taxonID == ‘RT’). The dictionary of arthropod taxa \ntherefore contains all accepted arthropod species names along with the taxonomic lineage \nnames ascending the species-genus-family-order-class hierarchy up to the phylum level of \nArthropoda. This resulted in a dictionary containing a total of 1’015’642 species and 118’008 \nhigher-level taxonomic names, for use as the input for downstream NER steps to label taxa \nidentified in the processed documents. The COL processing scripts are available as part of the \nATResourceManager Snakemake workflow (https://github.com/IDSIA-\nNLP/ATResourceManager). \n \nOrganismal and Eco logi cal Trait Da ta  \nNo single, comprehensive, standardised, and machine operable ontology of organismal and \necological traits was available to use to build a dictionary of arthropod trait data. Therefore, \nextensive manual curation of traits defined across several different resources was required to be \nas comprehensive as possible while leveraging existing resources and standards. Trait libraries \nwere developed for three broad categories covering arthropod feeding ecology, habitat, and \nmorphology, always requiring that the trait was defined and/or described in an existing online \nresource. The resources queried included: the Encyclopedia of Life (EOL) (Parr et al. 2014); the \nEnvironment Ontology (ENVO) (Buttigieg et al. 2016); the Relation Ontology (RO) (Mungall et \nal. 2023); the UBERON Anatomy Ontology (Mungall et al. 2012); the BRENDA Tissue Ontology \n(BTO) (Chang et al. 2021); as well as Wikidata, Wikipedia, and Wiktionary for additional relevant \nterms with Uniform Resource Identifiers (URIs) that were not incorporated into a formal \nontology. Trait names and definitions were inherited from the source ontologies/URIs. Traits \nwere classified into types: “yes/no”, a taxon exhibits or does not exhibit the trait; “association”, \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n6 \none taxon is associated with another taxon through the trait; “measurement” for mass-related \ntraits; “length/width” for measurable body parts; “count” for countable body parts. Synonyms of \ntrait names were automatically generated by scraping synonyms from Synonyms.com and \nenhanced with word vectors from PubMed and the Common Core to obtain related terms. This \napproach was refined by implementing an improved search for synonyms, creating a new table \nthat included only those synonyms appearing at least ten times in the 5’000 articles from the \nZooKeys journal that were available in PMC. This method was devised to capture real words \ncommonly used in taxonomic treatments, managing to generate synonyms for most terms. \nHowever, the synonyms contained instances of inaccurate or inappropriate terms so manual \ncuration was applied to reduce redundancy and improve informativeness of the alternative \nphrasings including pluralisations. This resulted in a dictionary containing a total of 390 traits (81 \nfeeding ecology; 184 habitat; 125 morphology, Supplementary File S1), to be used as the input \nfor downstream NER steps to label traits identified in the processed documents. \n \n \nCurating Gold-Standard Annotation Data \n \nTo fine-tune and formally evaluate the performance of the NLP models employed for the NER \nand RE tasks (see below), two entomology domain experts annotated a set of 25 articles \nrandomly selected from the 3’650 obtained from PMC. The two annotators employed the tagtog \ntext annotation tool (Cejuela et al. 2014) that provides a user-friendly interface to manually \nannotate and normalise entities in documents imported from PMC, as well as to add entity \nlabels, relationships, and more. The annotators worked independently on their assigned \ndocuments to avoid biasing each other, however, they did develop a set of guidelines during the \nannotation process to describe the steps to follow for dealing with some more complex cases \n(Supplementary File S2). For example, in addition to the required entities “taxon”, “trait”, and \n“value”, and the required relationships “has_trait” and “has_value”, the entity “qualifier” and \nrelationship “has_qualifier” were created in order to label and link taxonomic qualifiers to \nlabelled taxa entities such as “female”, “male”, “juvenile”, or “larva”. To be able to compare \nannotator styles, a subset of five documents was independently annotated by both experts. The \ncompleted annotations of the 25 documents were exported from tagtog. Each document, \noriginally in tagtog’s native HTML format accompanied by annotations in JSON format, was \nconverted into a BioC JSON file (Comeau et al. 2013). This format serves as the base for all \nsubsequent processes. The documents annotated by both experts were used to calculate the \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n7 \ninter-annotator agreement by means of Cohen’s Kappa score (Cohen 1960). The agreement \nwas evaluated by comparing exact and partial matches of named entities, as well as exact \nmatches of relationships within a tolerance of 4 characters for the entity offset boundary in these \ndocuments. To be able to formally evaluate the performance of the NLP models the gold-\nstandard annotation data were randomly split into a training (TRAIN-GOLD) and an evaluation \n(TEST-GOLD) dataset (Supplementary File S3). The TRAIN-GOLD dataset contained four \ndocuments (812 entities, 641 relationships) and the TEST-GOLD dataset contained 21 \ndocuments (4’272 entities, 3’439 relationships), corresponding to a 1’453/7’711 (18.8%) ratio \nbetween TRAIN-GOLD and TEST-GOLD. The data split was deliberately chosen, opting for a \nlarger TEST-GOLD dataset to ensure it is statistically robust for valid analysis. Simultaneously, \nthe TRAIN-GOLD dataset was intentionally kept small enough to train NER and RE models \nunder a low-resource setup. \n \n \nNatural Language Processing for Entity and Relationship Annotation \n \nFor automatic identification of arthropod trait value triples, a two-step pipeline was developed: \nFirst, entities (arthropods, traits, and values) are recognised (identified in the texts) and \nnormalised (mapped to dictionaries of known entities, if possible) and then the underlying \nrelationships between arthropods and traits and between traits and values are extracted \n(annotated) where possible. Different entity recognition systems were therefore implemented to \nautomatically recognise arthropods (taxa), traits, and values, and then normalise the recognised \nentities according to a set of terminologies. To identify the relationships between the recognised \nentities, a relationship extraction model was applied where each relationship connects exactly \ntwo entities and the relationship type is determined by the types of connected entities. \n \nNamed Entity Recognition (NER):  The NER steps employed and tested several Bidirectional \nEncoder Representations from Transformers (BERT)-based models, namely BERT-large-\nuncased (Devlin et al. 2019), RoBERTa-large (Liu et al. 2019), and the domain-adapted \nBioBERT (Lee et al. 2020) model initialised on the BERT model and further pre-trained on \nPubMed abstracts and PMC complete articles. For NER of value entities, pre-trained models \nwere used from SpaCy (Montani et al. 2023) and quantulum3 (Mündler 2024), which are \nlibraries that are specialised in flexible matching of measurements and their entities. \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n8 \nNamed Entity Normalisation (Linking) (NEN):  To perform NEN, the OntoGene BioMedical Entity \nRecogniser (OGER) was used, OGER offers a set of tools for text mining and information \nextraction (Basaldella et al. 2017, Furrer et al. 2022). The normalisation (or linkage) of entities is \nachieved by flexible matching of the recognised entities with the curated arthropod and trait \ndictionaries described above. \n \nRelationship Extraction (RE):  The Transformer-based LUKE (Yamada et al. 2020) model was \nused for the RE task. The LUKE model takes a text string along with the offsets of a head and \ntail entity to perform classification according to a set of relationship labels. \n \n \n \nTechnical Specifications of the ArTraDB Web Resource \n \nArTraDB is a web application designed and built to present the predicted annotations to the \nscientific community. The following technology stack was used to build the resource: data are \nstored in a Neo4J database ( https://neo4j.com/\n) and made available through a backend \nApplication Programming Interface (API) based on express.js ( https://expressjs.com); the \nfrontend was built using Vue.js ( https://vuejs.org) and node.js ( https://nodejs.org); for \nvisualisation of annotated documents the TextAE annotation editor \n(https://textae.pubannotation.org) (Lever et al. 2020) was integrated into the web application. \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n9 \nResults \n \nA Workflow for Annotating Arthropod Organismal and Ecological Traits \n \nThe analytical workflow for processing and annotating thousands of articles to identify \norganismal and ecological traits of arthropods (Figure 1) consists of several key data \npreparation steps (ATResourceManager and Domain Expert curation) and model training \nprocedures (ATTrainer), in or der to subsequently perform the text mining tasks (ATMiner) to \nproduce the predictions and upload them for viewing in ArTraDB. Firstly, the domain expert \ncuration tasks resulted in two key outputs: the Gold Standard Annotations and the Curated Trait \nVocabularies. A set of selected documents was manually annotated by domain experts to \nprovide a resource for downstream training and for assessing the performance of the text \nmining tasks (see Methods). The domain experts also built curated trait vocabularies (including \nsynonyms) covering the three categories of feeding ecology (n=81), habitat (n=184), and \nmorphology (n=125), based on combinations of existing ontologies and online resources (see \nMethods). In parallel, the ATResourceManager preparation steps were developed to: (1) \nprocess the taxonomic treatment documents from Plazi and retrieve the corresponding \npublications from PMC; (2) extract from the Catalogue of Life taxonomy all accepted arthropod \nspecies and their higher-level taxonomic names; and (3) extract from the Encyclopedia of Life \ntraits database all available taxon-trait annotations for arthropods (see Methods for details). \nSubsequently, the ATTrainer language model training steps take as input the Gold Standard \nAnnotations (TRAIN-GOLD subset) for the fine-tuning of the BioBERT model and for the training \nof the LUKE model (see Methods). These models are then used in the ATMiner tasks for \nNamed Entity Recognition (NER) with BioBERT and Relationship Extraction (RE) with LUKE, \nalso using the curated trait vocabularies to perform entity normalisations using OGER (see \nMethods). The resulting predicted annotations - the entities of arthropods, traits, and values - \nand the arthropod-trait and trait-value relationships were then imported into the ArTraDB web \nresource where they can be reviewed by the community. \n \n \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n10\nFigure 1: The arthrop od orga nismal an d ec ologic al traits a nn otati on w orkflow.  \nTh e work flow s tart s with cur ati on per form e d by dom ai n ex p erts re sulti n g in en tity an d rel ati on shi p a n no ta tio ns f or a s ele ct e d\nsu bse t o f p u blica tio ns as w ell as c ur at ed v oc a b ulari es o f s ets of org a nismal a nd ec olo gic al trai t s. Th e ATR eso urc eM an a ge\nste ps i ncl ud e t h e pr oc essi n g o f d at a s o urc ed from  th e C at alo g ue of  Lif e (t ax o nom y) a n d t he E ncycl o pe di a o f Li fe ( art hro p od- trai\nrelati o nshi ps) to  ge n erat e ta xo n a nd  tr ait dicti o nari es,  a s w ell as t h e r etrie val  o f pu blic ati ons  f or proc essi ng  fr om Pu bM ed  C e nt r a\nba se d on th e s ele cti on of all Pl azi Tr ea tme nt B an k  recor ds for art hro po ds . Th e ex per t-g en er at ed Gol d S ta n dar d A n no ta tio ns ar e\nu se d a s i npu t  t o t r a in  (A T T ra i ne r  s tep s )  Na tu r a l L ang uag e P ro ce s s in g ( N LP )  mo de l s  f o r th e N a me d En t i ty  Re co gn i t io n ( N E R\nan d R el ati ons hi p Extr acti o n (RE)  t asks , wit h th e  trait  v oc ab ulari es a nd  t ax a dicti on arie s bei n g u se d f or en tity  n orm alisati o\nn\n(ATMin er s te ps) . Fi nally , th e pr edic te d a nn ot ati o ns  ar e m a de  a vail abl e to  t h e sci e ntific  c ommu nit y vi a  th e  ArTr aD B w eb  re so urc e\nwher e c ommu nity c ur at ors co ul d p ot en tially provi de c orre ctio ns t o t he a nn ot ati on s th at c a n lat er b e us e d for re fin em en t o f th e\nNER a nd  ER  mo d els ( do tt ed  lin es).  \n \n \n10 \n \nd  \nr \nt \na l \ne  \n)  \nn  \ne  \ne  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n11 \n \nEntity and Relationship Annotation of PubMedCentral Articles  \n \nAnnota tio n Resul ts f or Entity a nd Relat ion ship Disc overy  \n \nThe application of the workflow presented in Figure 1 to a total of 2’000 publications sourced \nfrom PMC resulted in the annotation of 656’403 entities (arthropods, traits, and values) and \n339’463 relationships (hasTrait, hasValue), summarised in Figure 2. The PMC articles range in \nlengths from 173 to 27’466 characters with a median of 15’720 and an interquartile range from \n11’452 to 20’506 (Figure 2A). The densities of entity and relationship predictions are highest for \nentities of type “value” and hasValue relationships, with medians of ~10 annotations per 1’000 \ncharacters, arthropod and trait entities have median densities of four and six annotations per \n1’000 characters respectively, and hasTrait relationships has the lowest density with a median \nof zero annotations per 1’000 characters (Figure 2B). In contrast, the 25 articles comprising the \ngold standard annotation dataset show median densities of 4.9, 6.4, and 6.1 annotations per \n1’000 characters for arthropod, trait, and value entities, respectively, and the hasTrait and \nhasValue relationships show medians of 6.4 and 6.1 annotations, respectively (Figure 2C). \nThese manually annotated documents (17 of which were complete articles and eight only \nabstracts) contained in total 4’990 named entities (1’069 arthropods, 2’078 traits, and 1’843 \nvalues) and 3'628 relationships (1’777 hasTrait and 1’851 hasValue). For the predicted \nannotations, the total numbers of entities and relationships are generally higher in longer \ndocuments, reaching over 900 and over 600 annotations, respectively (Figure 2D). This trend is \nreplicated when considering each entity and relationship subtype separately, with the largest \nnumbers of annotations identified in some of the longest documents, reaching maxima of 456, \n393, and 669 for arthropods, traits, and values, and 100 and 717 for hasTrait and hasValue, \nrespectively (Figure 2E). At minimum, a publication should contain one taxonomic treatment \ndescribing a single arthropod species, e.g. “ Pachybrachis sassii, a new species from the \nMediterranean Giglio Island (Italy) (Coleoptera, Chrysomelidae, Cryptocephalinae)” (Montagna \n2011) (length: 13’550 characters; annotated entities: 63 arthropod, 79 trait, 149 value). \nHowever, the much longer articles generally describe a whole group of species for a particular \nregion, e.g. “The dipteran family Celyphidae in the New World, with discussion of and key to \nworld genera (Insecta, Diptera)” (length: 27’381 characters; annotated entities: 172 arthropod, \n141 trait, 246 value) contains 92 taxonomic treatments (Gaimari 2017).  \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n12\n  \nFigure 2: Di stribu tion s of PMC article  prop erties and the resu ltin g en tity a nd relati on ship a nno tati ons.  \n(A ) The i n pu t d at as et c o nsist e d o f 2’ 0 00 PMC article s e xhi biti ng a bro ad ch ara ct er (ch ars) le n gt h dis trib uti on . T he r ela tiv e\nnum b er o f r esul tin g pre dicti o ns of a nn ot at e d en titi es ( arthr o po ds , tr aits,  v alu es) an d r ela tio ns hip s (t a xo n t o tr ait -  h asTr ait,  trai t t o\nvalu e - ha sV alu e) are s h ow n f or t he w h ole d at as et ( B )  a nd  fo r  the  25  g o ld - s tan da r d  m anu a l l y  ann ot at ed  d oc u men t s  ( C ). Th e\nab sol ut e n um ber o f pr edic te d e nti ty a nd r ela tio ns hip an n ot atio ns c om par ed t o t he d ocum e nt le n gt h s in ch arac ters is s ho wn f o\nan n ot ati on  ty pe s ( D ) an d su bty p es ( E) .  Bo x p lo ts  in  p ane l s  A , B , and  C  s ho w  t he  med i an , f i r s t  an d th i r d  qu a rt i l e s,  and  lo w e r an d\nup p er extr eme s of th e distri bu tio n ( 1. 5 ×  I nt erq u art ile ra n ge).  \n \n \n12 \ne  \no  \ne  \nr \nd  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n13\nAssessing the C omplexity of t he T ask by E xamining I nter-Ann ota tor Agreement  \n \nTo begin to interpret the prediction results from the workflow it is important to understand the\ncomplexity of the annotation task itself, insights into which can be gained by examining the\nlevels of agreement between the curated annotations gene rated by the two domain experts.\nThe five documents that were annotated by both domain experts included two complete\npublications and three abstract-only articles (Figure 3). In total for these five documents, the two\nexperts annotated 1’477 named entitie s (161 arthropods, 764 traits, 552 values), with annotator\n1 identifying 80 arthropods, 416 traits, and 334 values, and with annotator 2 identifying 81\narthropods, 348 traits, and 218 values. They also annotated a total of 1’094 relationships (553\nhasTait, 541 hasValue), with annotator 1 identifying 343 hasTrait and 343 hasValue\nrelationships, and with annotator 2 identifying 210 hasTrait and 198 hasValue relationships.\nCohen’s kappa is used to measure inter-annotator reliability, or concordance, to assess the\ndegree of agreement amongst independent observers who rate, label, or classify the same\nphenomenon, with values below 0.6 generally indicating inadequate agreement (McHugh 2012).\nAmongst the five documents curated by both annotators, Cohen’s kappa valu es reflect varying\nlevels of inter- annotator agreement for entities (Figure 3B), from poor agreement (~0.35), to\nmoderate agreement (~0.5), to substantial agreement (~0.8), with lower agreement levels for\nrelationships (Figure 3C). While Cohen’s kappa scor es provide a standardised measure of\nagreement, their values must be carefully interpreted within the study’s context, where here they\nserve to highlight the complexity of the annotation tasks. \n \n \nFigure 3: Inter-a nn otat or agreement o f five  documen ts an no tated  by bot h experts .  \n(A ) Th e bars  s ho w doc um en t l en gt hs  in  c har act ers (c har s). ( B )  The  b a rs  sh ow  the  le v e l of  a nnot a t ion  a g re e men t fo r  en t it i e s\n(tax on , tr ait, or v alu e) be twe e n t h e tw o an n ot at ors  as m e asur e d u sin g C oh e n’s k ap p a. ( C ) T h e b ars sh ow t he  lev el of an n ot ati o n\na g ree m en t fo r  r e la t io ns  ( ha s T ra i t o r  ha sV a lu e ) be tw e en  t he  t wo  a nno ta to r s  a s  mea s u red  u s i ng  C ohen ’ s  kap pa .  \n \n13 \nhe \nhe \nts.  \nte \no \ntor \n81 \n53 \nue \ns. \nhe \ne \n2). \nng \nto \nfor \nof \ney \n \ns  \nn  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n14 \nAssessing En tity N ormalisation wi th t he T axon a nd Trai t Dictio naries  \n \nEntity normalisation, or linking, which was performed using OGER, is the process of matching \nthe entities that were annotated in the articles with the dictionaries of arthropods (taxa from the \nCatalogue of Life) and traits (the collated sets of feeding ecology, habitat, and morphology \ntraits), with the goal of assigning to each labelled entity a unique identifier from one of the input \nresources. It is important to understand the performance of the normalisation task in order to \ninterpret the quality of the entity prediction results from the workflow. The taxon dictionary \ncontained a total of 1’015’642 arthropod species and 118’008 higher-level taxonomic names \nand the trait dictionary contained a total of 390 traits: 81 feeding ecology; 184 habitat; 125 \nmorphology (see Methods). Focusing on the identification and quantification of taxon and trait \nentities within the article corpus, the coverage and frequency of entities mapped to the \npredefined dictionaries and those that could not be mapped provide an assessment of the entity \nnormalisation process and the comprehensiveness and relevance of the dictionaries (Figure 4).   \n \nAcross the 2’000 articles processed by the workflow, a total of 128’149 taxon entities were \nannotated with 63% matching terms in the dictionary (mapped to Concept IDs), comprising \n24’207 species and 56’532 higher-level taxonomic names (Figure 3A). Notably, taxa such as the \norder Hymenoptera (sawflies, wasps, bees, and ants), or the genus Tipula (crane flies), were \namongst the most frequently annotated entities, with 388 and 384 occurrences, respectively. \nReviewing some of the 47’312 non-mapped taxon entities revealed examples of correctly \nidentified arthropod taxa which nevertheless are not included in the Catalogue of Life and are \ntherefore not in the dictionary, e.g. the genera Micrencaustes (beetles) and Deinodryinus \n(parasitoid wasps). Of the 199’276 (28’348 unique) trait entities annotated, 12.7% were \nsuccessfully mapped to the trait dictionary (linked to Concept IDs), with 1’366, 85, and 23’816 \nentities mapping to feeding ecology, habitat, and morphology terms, respectively (Figure 3A). \nNote that the NER task labels entities as taxa, traits, or values, further categorisation of traits \ninto feeding ecology, habitat, and morphology terms is only possible when entity normalisation \nis successful. Feeding and morphology traits such as “host” (273), “mouth” (16), and “legs” \n(1’663) were amongst the most prevalent annotated entities. Many annotated trait entities could \nnot be mapped (168’237), one of the most frequent being “distribution”, with 2’506 occurrences. \nConsidering instead the numbers of unique terms in the dictionaries that could be annotated \nand linked in the articles, only 14’243 of the 1.2 million arthropods in  the dictionary (1.2%) were \nidentified, and from the trait dictionary of feeding ecology, habitat, and morphology terms, \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n15 \n28.4%, 3.3%, and 60%, respectively, were annotated and linked (Figure 3B). Failure to identify \ndictionary terms in the annotated articles may be because the terms are simply not present (it \ncannot be expected that a small subset of 2’000 arti cles will contain mentions of all 1.2 million \ndescribed arthropod species), or because normalisation was unable to link terms and phrases \nrecognised as entities in the texts with the terms, phrases, and synonyms that make up the \nconcepts of the dictionaries. \n \n \n \nFigure 4: Tax on a nd trait dicti onaries c ompared with ann ota ted e ntiti es.  \nFor t h e 2' 0 00  PM C article s a nalys e d:  t he  pr o po rtion s of  all a nn ot at e d e ntiti es th at  c oul d b e m a pp e d t o  t he  c orres p on din g  \n“Con ce pt  IDs” of  t he  t ax on  a n d tr ait dicti o nari es ( A ), a n d t h e pro por tio ns of  t ax on  a n d tr ait dicti o na ry terms  t ha t w ere  m atc he d  \nw i t h  ann ot at ed  en t i t ie s  i n  a ny  a r t i c le  ( B ). In (A) ‘m ’ repres e nts t h e to tal n um bers of t ax on an d trai t e ntiti es a n d ‘k’ indic at es h ow \nman y of t h es e wer e ma pp e d t o Co nc ep t IDs in t h e dic tio nary termlist s (for tr ait e ntiti es divid e d int o f ee di ng ec olo gy, ha bit at ,  an d \nmorp h olo gy). In (B) ‘n’ re pr es en ts th e t ot al n umb ers of t ax a a n d trait s in t he dicti on ary t ermlists a nd ‘l’ in dica te s h ow ma ny o f \nth es e w ere  ma tch e d i n t h e articl es.  \n \n \n \n \n \n \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n16 \nPerformance Comparisons of Natural Language Processing Models \n \n \nNamed Entity Reco gni tion Ba seline Perfor mance \n \nAn evaluation of a Named Entity Recognition (NER) baseline was conducted across various \nconfigurations. Several general and domain-specific pre-trained language models were fine-\ntuned on the TRAIN-GOLD dataset. To train the models, the dataset was converted to IOB2 \nformat. Two evaluation methods were employed for the results presented in Figure 5: the \nConference on Natural Language Learning (CoNLL) evaluation and strict metrics. The reported \nresults are based on the F1-score (F) and corresponding Precision (P) and Recall (R). Under \nthe CoNLL evaluation, the baseline demonstrated a macro-average 0.56 F (0.55 P / 0.57 R) and \na weighted-average of 0.52 F (0.53 P / 0.53 R) across all entity types. Notably, entities classified \nas ‘Arthropod’ achieved the highest F1-score at 0.74 F (0.7 P / 0.78 R), signifying superior \nrecognition capabilities in comparison to other categories. Conversely, ‘Value’ entities posed \ngreater challenges, with the lowest score of 0.37 F(0.32 P / 0.43 R). This indicates substantial \ndifficulties in the precise identification of these entities. ‘Value’ entities encompassed a diverse \narray of concepts, ranging from measurements ( e.g., ‘56.6 mm’) and colour descriptors ( e.g., \n‘brownish-yellow’) to locations ( e.g., ‘China’). This disparity highlights the model’s varied \nperformance across different entity types. When evaluated using the strict metric, a notable \nenhancement in both precision and F1-scores was observed for most entity types, compared to \nthe CoNLL evaluation metric (Figure 5). ‘Arthropod’ entities maintained the highest score 0.78 F \n(0.78 P / 0.78 R), consistent with the previous evaluation. The overall macro- and weighted-\naverage scores increased to 0.59 F (0.63 P / 0.57 R) and 0.56 F (0.6 P / 0.52 R), respectively, \nindicating a more accurate entity recognition when the strict metric was applied. This \ncomparison not only underscores the baseline’s strengths and weaknesses in recognising \nvarious entities but also highlights the impact of evaluation criteria on perceived performance. \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n17 \n \nFigure 5: C oNLL evalua tio n an d strict F1-s core baseli ne results  for t he named e nti ty recogniti on.  \nTh e F1-sc ore per form a nce of t he ba seli ne N am e d E ntity R ec og niti o n (NER) mo del on t h e te st se t of P u bMe dC en tral (PM C) \narticle s usi n g th e Co nf er enc e o n N at ural L a ng u ag e L ear nin g (Co NL L) ev al ua tio n (co nllev al) a n d stri ct met h od s. Th e sc ore s are  \ns ho w n ex p l i c i t ly  fo r  th re e en t i ty  ty pe s : A rt h ro pod ,  T r a i t , and  V a lu e.  A dd i t ion a l l y,  fo r  the  th re e t yp es  co m b ine d,  the  ma c r o -a ve r age \nF1-sc ore , w hich d oe s n ot  ac co u nt for cl ass  imb al an ces , an d t h e w eig ht e d a ver ag e F 1-sc ore , w hic h  adj us ts f or t h e imb al anc e of \ndiff ere nt  cla ss es, ar e also  pr es en te d.  A  d et ail ed  lis tin g of all ba seli ne  res ults  c an  b e fo u nd  in  S u ppl e men tary  File  S 4.  \n \n \nRelations hip Extrac tion Ba seline Performance  \n \nFigure 6 outlines the outcomes of the Relationship Extraction (RE) baseline across three \ndifferent configurations of the LUKE model, namely “NCB” (None-Class Balanced), which limits \nthe amount of ‘none’ relationships during training to match the majority class, “Tag” which uses \nXML to tag the entities inline, and “Long-Range”, which captures long-range relationships via a \ndifferent training setup. By default the LUKE model was used with a shifting context window \nspanning 1 to 6 consecutive sentences to detect relationships. In contrast, for the Long-range \napproach a version of the LUKE model was fine-tuned by extracting and merging the two target \nentities with their 500 surrounding characters each. The NCB approach, even after balancing \nthe frequency of ‘none’ classes with the most common relationship (‘hasValue’), continued to \nface challenges in accurately identifying specific relationships like ‘hasTrait’. This indicates \npersistent difficulties in detecting nuanced or less common relationships. Furthermore, applying \nthe Tag approach in addition to the NCB approach improved the RE baseline in the macro-\naverage score from 0.57 F (0.62 P / 0.66 R) to 0.65 F (0.66 P / 0.69 R) compared to the \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n18 \nstandard NCB configuration. This suggests that entity tag information contributes positively to \nrelationship extraction performance. Using NCB and Tag combined with the Long-range RE \nsetup demonstrated an interesting pattern with ‘hasValue’ relationships, where a perfect recall \n(1.00) but very low precision (0.02) resulted in a low F1-score (0.03). This indicates the model’s \ntendency to over-identify ‘hasValue’ instances, leading to numerous false positives. The \nprediction results for the ‘hasTrait’ relationship show 0.2 F (0.13 P / 0.52 R) while scoring for \n‘none’ relationship 0.92 F (1.0 P / 0.86 R), and a macro-average of 0.38 F (0.38 P / 0.79 R) , \nmaking it the worst-performing baseline.   \n \n \nFigure 6:  NONE-class balance d (NCB), en tity tag ged ( Tag), a nd lon g range baseli ne  results for relationshi p \nextractio n.  \nTh e F 1-sc ore p erf orma nc e of thr e e ba seli ne  c on fig ura tio ns  o f t h e R ela tio ns hip  Ex trac tio n (R E) m od el on  a  t est  s et of  \nPu bM edC e ntr al (PMC) articl es. T h e c on fig ura ti on s ar e: N o n-Clas s-B ala nc ed  (NCB), whic h li mits t he am ou nt of ‘ n on e’ \nrelati o nshi ps  d urin g  trai nin g  t o m atc h th e  maj ority  class ; XML  I nlin e T a g E ntiti es (T a g); a nd  L o ng-r a ng e Co nt ext , w hic h pro vid e s \na c on te xt of  2 5 0 c har act ers ar ou n d e ac h tar ge t  e ntity  rat h er t h an usi n g a  slidi n g c on te xt wi n do w. P erf orma nc e s cor es are  \nsp ecific ally sh ow n for tw o rel ati ons hi p ty pes : ‘h as Trait’ b etw e en Ar thr o po d a nd Tr ait, ‘ has V alu e’ b e twe en Tr ait a n d V alu e, a n d \n‘no n e,’ w hic h i ndic at es  t he  a bs e nce  of a r ela tio ns hip . Ad diti on ally,  t he  fi gur e i nclu d es b ot h t h e m ac ro aver a ge  F1- scor e,  whi ch  \ndo es no t acc o un t for cla ss imb ala nc es, an d t h e wei gh te d a ver ag e F1- scor e, w hich c om pe ns at es f or th es e imbal a nc es. A  \nde tail ed  listi ng  o f all b as elin e r es ults  ca n b e f o un d in Su p plem en tar y T abl e S 4.  \n \n \n \n \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n19 \nThe Arthropod Trait Database ArTraDB Web Resource \n \nThe annotation predictions obtained from applying the workflow to the PMC articles are made \navailable to the community through the dedicated web application, ArTraDB: the Arthropod Trait \nDatabase ( https://artradb.unil.ch). The results are presented in a simple table-like view where \neach row represents a single entity annotation, pairs of entities connected by either a hasTrait \nor hasValue relationship, or complete trio annotations of connected Arthropod-Trait-Value \nentities (Figure 7). The ArTraDB resource was designed and developed to provide two main \nfunctionalities: (i) Browse and search facilities enabling the identification of predicted \nspecies/taxa and/or traits and/or values within the set of annotated documents; and (ii) \nBrowsable visual displays of the predicted entity and relationship annotations in the local \ncontext of the corresponding document. \n \nArTraDB Browse and Search Func tio naliti es \nThe browsable table view of the workflow-predicted annotations for the set of processed articles \nprovides a paginated display of rows of annotated entities together with the NER confidence \nscores assigned by the prediction algorithm (Figure 7). Where both hasTrait and hasValue \nrelationships connect an arthropod entity with a trait entity and that same trait entity to a value \nentity the row represents a complete Arthropod-Trait-Value trio annotation. When either hasTrait \nor hasValue relationships are lacking, the row displays only the Arthropod-Trait or Trait-Value \npairs. If no relationships were predicted, the arthropod and trait entities are displayed as single \nannotations with their corresponding scores, while the value entities are omitted. When entity \nnormalisation was able to successfully link annotated arthropods and traits to Concept IDs in the \ncorresponding dictionaries, these are hyperlinked to the corresponding source definition, e.g. \nthe Catalogue of Life for arthropod entities, and the Encyclopedia of Life or ontology and Wiki \nresources for trait entities. Additional columns in the table view include a clickable icon to open \nthe popup Annotation Viewer window (Figure 8), the PMC identifier hyperlinked to the fully \nannotated document, the source of the annotation ( i.e. version of the workflow that made the \npredictions). The table view can be browsed page by page with a user-configurable number of \nrows to display per page. The data in each column are indexed to enable rapid user searches \nusing the simple search box above the table that filters the results to contain only rows matching \nthe entered search term. This simple table view of the annotations provides an intuitive \nbrowsable and searchable interface to the thousands of annotations produced by the workflow. \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n20 \n \nFigure 7: ArTraDB tab le view browse an d s earch fun ctio naliti es.  \nThe table view of  annotatio ns displays rows o f annotated entities with their  corres ponding Named Entity Recognition  \n(NER) confidence sco res, a  clickable icon  to  open t he Annotatio n Viewer window,  the  hy perlinked PubMedCe ntral \n(PMC) article identifier, and the source of th e annotation, in a browsable paginated for m at with a user-configurable \nnumber of items to display per page. Annotat ed arthropods and tr aits that were successfully linked to Concept IDs in  \nthe corres ponding dictio naries a re hype rlink ed to the  corres ponding so urce d efinition, e.g.  the Catalog ue of  Life \n(COL) for art hropods, a nd the Encyclopedia o f Life (EOL) or other resources fo r trait entities . Indexing of the ArTraDB \ndata allows for rapid user searc hes to filter t he complete table t o rows with entries matc hing terms ente red in the  \nsimple search box ab ove the ta ble.  \n \n \nArTraDB Document Contex t View of  Ann ot ations  \n \nFor each row of predicted entity or entity and relationship annotations, a clickable icon in the \nContext column allows users to open an annotation viewer window that shows these labelled \nentities and any predicted relationships in the local context of the source text. Alternatively, \nclicking the hyperlinked PMC identifier opens a document view of the fully annotated article \nshowing all entities and relationships. In both viewers, the entities are highlighted and labelled \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n21 \nas Arthropod, Trait, or Value entities, with the predicted relationships indicated by lines \nconnecting the relevant entities (Figure 8). These visualisation functionalities allow users to view \nthe predictions in their local and global contexts, to be able to manually assess the reliability of \nthe automatically generated annotations. \n \n \nFigure 8: ArTraDB do cument v iew for vis u alising a ll predic ted a nn otati ons i n an artic le.  \nClicking the hyperlinked PubMedCentral (PM C) identifier in the ArTraDB table view opens a document view of the \nfully annotated  article s howing all e ntities an d relationships.  The  entities a re hig hlighted a nd labelled as  Arthrop od,  \nTrait, or Value entities, with the predicted r elationships indicated by lines connecting the relevant entities. This \nprovides users with th e full-text cont ext of eac h predicted entity or relatio nship annot ation within the docume nt.\n \n \n \n \n \n \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n22 \nDiscussion \n \nFocusing on the methodological explorations of the opportunities presented by modern literature \nmining approaches, we aimed to annotate available arthropod organismal and ecological trait \ndata to facilitate the automated extraction of knowledge from publications. This involved building \nand testing an analytical workflow for processing and annotating thousands of PMC articles \ncontaining taxonomic treatments of arthropods. It also required the collation of curated trait \nvocabularies and the development of model training procedures to perform the text mining tasks \nand formally test the performance of various approaches. Finally, the resulting predicted \nannotations of the entities of arthropods, traits, and values, as well as the arthropod-trait and \ntrait-value relationships were made available for community review through the deployment of \nthe open online ArTraDB web resource. Here we discuss the key findings and important \nchallenges encountered while ex ploring the utility and perfor mance of the different \nmethodologies and approaches we tested to achieve our aims.  \n \n \nResults are variable because annotation is an inherently complex task \n \nThe process used to annotate the articles and train the models presents several methodological \nchallenges that influence the efficiency and accuracy of predicting arthropod-trait-value triples. A \nkey challenge was the generation of training data through the manual annotation of documents \nby domain experts. Compared to the predictions, the annotators identified higher median \ndensities of arthropod and trait entities and hasTrait relationships, while the automated \nprocesses produced higher median densities for value and hasValue annotations (Figure 2). \nNevertheless, assessing inter-annotator agreements highlighted the complexity of the \nannotation tasks and the challenge of uniformity in defining entity boundaries or relationships \nbetween entities (Figure 3). Defining entity boundaries and relationships in complex cases are \nsubjective decisions that can diverge based on the annotator’s understanding and experience. \nFor example, “ mesoscutum silvery setae evenly distributed ”: could be simply annotated as \ntrait=“mesoscutum” and value=“ silvery setae evenly distributed”, or by introducing a \ndiscontinuous trait “ mesoscutum setae ” with two associated values “ silvery” and “ evenly \ndistributed”. The variability in manual annotations introduces conflicts in the input data that \nserve as the basis for training the NER and RE models, which therefore negatively impacts the \nmodel performance. While independence is required to formally assess inter-annotator \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n23 \nagreement for such ta sks, the recommendati on from this work is to build a set of guidelines \nusing examples encountered in the documents. The guidelines represent the consensus of the \ndomain experts on how to proceed technically with annotating complex cases, and therefore \nserve to enhance consistency and ultimately provide better input data for training the models. \nIndeed guidelines and annotator training in the context of constructing NER and RE \nbenchmarking corpora of abstracts and metadata files from biodiversity datasets achieved inter-\nannotator entity agreements of 0.76 and 0.70 for two pairs of annotators (Abdelmageed et al. \n2022). As well as the manual annotation variability, the low amount of training data - the 25 \narticles comprising the gold standard annotation dataset - also posed challenges. One strategy \nto increase the amount of training data could be to use distant supervision techniques, such as \nemploying large language models (LLMs) to generate structured annotated texts based on the \nmanually-annotated examples (Li et al. 2024). This could introduce biases and would likely be \nlimited to shorter text bodies and smaller annotation context ranges than those encountered in \nthe published articles, but the approach remains worth considering to augment training data. \nUltimately, more consistent and better quality annotations across many more articles could be \nachieved through community review of the NER and RE predictions. The integration of such \ncommunity-reviewed annotations into future training datasets (Figure 1) could therefore serve to \niteratively improve model performance with a growing corpus of articles in the gold standard \nannotation dataset. \n \n \nEntity recognition performs better than relationship extraction \n \nThe NER and RE tasks present distinct challenges as evidenced by the performance disparities \nin the baselines (Figure 5, Figure 6). NER focuses on identifying and classifying entities in the \ntext, with results showing a varied performance across different entity types, where “Arthropod” \nentities were consistently easier to correctly recognise than “Value” entities and “Trait” entities \nshowing intermediate performance (Figure 5). In contrast, RE aims to identify relationships \nbetween entities, which is an inherently more complex task due to the need to understand \ncontext and entity interactions where the entities themselves may not be correctly or completely \nidentified. The results indicate that while the baseline is proficient in identifying the absence of a \nrelationship (“none”), it struggles with more specific and low-frequency relationships. The \nbalancing of classes in the training and the introduction of entity tags slightly improved the \nperformance but also revealed the model’s limitations in generalising across different types of \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n24 \nrelationships (Figure 6). Of particular interest with respect to the processing of taxonomic \ntreatments is the question of context sizes when attempting to identify entity relationships. The \nhasTrait relationships can often be long-range and few-to-many because the arthropod may be \nmentioned a few times near the start of an article where the text that follows is then dense with \nmany specific trait entities, e.g. a paragraph with detailed descriptions of the morphology of the \narthropod. In contrast, the hasValue relationships are more often likely to be short-range and \none-to-one as the values of the mentioned traits are usually presented in very close proximity \n(same sentence). The close proximity can still present substantial challenges, e.g. in taxonomic \ntreatments where lists of traits are followed by corresponding lists of values: “ Antennal \nsegments III–VIII length 38, 47, 43, 41, 33, 21 ” (here with the added complexity of an inferred \nlist of segments III, IV, V, VI, VII and VIII). Given these differences, alternative strategies where \ntraining and RE steps are carried out separately for hasTrait and hasValue relationships might \nperform better. Overall, the distinction betw een the NER and RE tasks is evident here in their \nrespective challenges and baseline performances. NER requires accurate classification of \nindividual entities, while RE demands a deeper understanding of the context and the \ninteractions between entities. This difference in complexity was also reflected in the lower inter-\nannotator agreement scores achieved for RE than for NER annotations (Figure 3), which would \nhave reduced the effectiveness of training the LUKE model. While the use of NER and RE in \nmining biodiversity literature has shown promising results, the complexities associated with \ntraining data quality and annotation consistency remain substantial challenges. Addressing \nthese issues will be essential fo r advancing the capabilities of NL P applications in biodiversity \nresearch. \n \n \nEntity normalisation is a critical yet challenging process \n \nThe task of normalisation involves linking the identified entities to defined dictionaries, \nvocabularies, or ontologies with accompanying descriptions and associated information. Without \nlinking, the annotation simply represents a hypothesis that a given entity can be classed as \neither an arthropod, a trait, or a value. Taxonomic names are structured terms deriving from \nacademic consensus, they are also relatively consistent and used in a similar format in most \npublications, therefore normalisation should usually be feasible. Indeed 63% of taxon entities \nmatched terms in the dictionary (Figure 4), i.e. they could be linked to species and higher-level \ntaxonomic names from the Catalogue of Life (COL). This result is promising when compared to \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n25 \nthe performance of other NER systems developed specifically for taxonomic name recognition, \nwhich, depending on the corpus used for assessment, can range in precision from 23% to 96% \n(Le Guillarme and Thuiller 2022). In  preparing the arthropod taxa dictionary only COL accepted \nnames were considered (see Methods), however, given the diversity of sources and ages of the \nprocessed publications, a universal taxon dictionary including all ever-recorded names and \nsynonyms would have resulted in higher normalisation levels. With respect to traits, \nnormalisation levels were considerably lower than for taxa, reflecting the much less structured \nmanner in which traits are usually described in publications (Figure 4). Of the three categories of \ntraits, morphology achieved the highest level of linking, likely explained by the expectation for \ntaxonomic treatment publications to be dense in morphological descriptions that are used to \ndefine and distinguish between species. While nearly a third of the feeding ecology traits from \nthe dictionary could be linked to trait entities in the documents, this represents only a small \nfraction of all annotated trait entities. In contrast, linking of habitat traits proved very challenging, \nwith only six terms being linked to annotated trait entities. This likely reflects considerable \ndifferences between the formal style of term names used by the Environment Ontology and the \nmore variable descriptions of habitats used in natural language. In summary, while entity \nnormalisation remains challenging, iterative extension, refinement, and curation of the \ndictionaries should lead to higher levels of linking, e.g.  by expanding the taxon dictionary to \ninclude all known names and synonyms, and by extending, revising, and curating the feeding \necology vocabulary and habitat ontology to add synonyms and additional terms that align better \nwith natural language usage. Additionally, while this work focused on arthropod organismal and \necological traits, enhancing the trait dictionaries –which are key for the normalisation process– \nby aligning them with other existing ontologies developed for biodiversity research more \ngenerally, such as the BiodivOnto (Abdelmageed et al. 2021), could improve both entity \nrecognition and the linking of entities to formalised concepts. \n \nIntegrating community feedback into future ArTraDB functionalities  \n \nIn addition to enhancing the breadth and utility of the trait dictionaries and the performance of \nthe workflow, future work will also need to extend the set of gold standard annotations to \nimprove NER and RE performance through better training and benchmarking. An important \nsource of new annotations could be achieved through further developing the ArTraDB resource \nfunctionalities to allow user curation of the workflow-predicted annotations. This would require \nintegration of curation tools into the Annotation Viewer window (Figure 8) that allow community \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n26 \nusers to edit, add, confirm, or delete entity and relationship annotations. This would facilitate \n“community curators”, i.e., scientifically literate individuals but not necessarily domain experts, to \nconfirm or modify the annotations, as well as deleting wrong or adding missing annotations, in a \ncrowdsourcing-like operation. The development of this functionality opens up opportunities to \nthen incrementally fine-tune the prediction models by providing them with human-validated \nannotations as additional training data. While technically feasible and indeed prototyped in the \ndevelopment version of ArTraDB, before deploying such an interactive functionality it would be \nimportant to first build the infrastructure for the long-term preservation of community-sourced \ncurated annotations. This could involve the publication of community-confirmed entities and \nrelationships to infrastructures such as the Encyclopedia of Life or Wikidata, however this is \ntechnically challenging and would only cover normalised entities. Alternatives could be to \npublish Journal Article Tag Suite (JATS) XML and/or BioC JSON files on repositories like \nZenodo (CERN and OpenAIRE 2024), or even to archive individual annotations as \nnanopublications with detailed provenance information to identify the source document and \nlocation within the document. A more comprehensive approach might instead be to focus on \ninfrastructures supporting the central indexing of biodiversity-related literature (Pasche et al. \n2023a, 2023b) that facilitate the addition of annotations to the articles using JATS XML and/or \nBioC JSON formats. Once a sustainable solution is in place, then ArTraDB could begin to \ncollect community feedback for use as part of improved training and benchmarking datasets, \nand for collation into versioned annotation sets for archiving or integration into open biodiversity \nliterature services.  \n \n \nPerspectives on literature mining for biology, ecology, and evolution research \n \nOur methodological explorations to develop tools and resources to advance the use of text \nmining approaches in biology, ecology, and evolution research, demonstrate the feasibility of \nsemi-automating the building of open databases of organismal and ecological traits extracted \nfrom the literature. Even if the annotated arthropod taxon-trait-value triples are sparse, they \nenable researchers to quickly locate documents pertaining to specific species and traits. This \nnot only accelerates the initial stages of data curation but also points researchers to the exact \nlocations within documents where relevant data can be found, thereby having the potential to \nenhance the efficiency of research workflows. While there remain several technical challenges \nto overcome, including how to best leverage the power of modern LLMs in these processes \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n27 \n(Farrell et al. 2024, Marcos et al. 2024, Keck et al. 2025), the results provide a framework that \ncould be extended beyond the focus on arthropods. These and other TDM and NLP initiatives in \nthe biodiversity domain will enhance data synthes is studies, make literature reviews more \nreproducible, greatly facilitate identification of  research knowledge gaps and biases, as  well as \ndrive data-informed investigations of ecological and evolutionary trends and patterns (Farrell et \nal. 2022, 2024). When a trait or set of traits has been carefully curated, and the relevant group \nof species is well-represented with genomics data, researchers can begin to ask how genetic \nand genomic changes relate to observable phenotypic differences, e.g.  swallowtail butterfly \nlineages with host-plant shifts have more genes under positive selection than non-shifting \nlineages (Allio et al. 2021), and transitions to parthenogenesis (asexual reproduction) in stick \ninsects are accompanied by greatly reduced genetic diversity and reduced rates of positive \nselection (Jaron et al. 2022). As genomics and other “omics” data become more accessible, and \nas catalogues of species traits become more comprehensive, –relying on automation for scale \nand curation for quality control– new opportunities for studying complex evolutionary processes \nwill emerge (Cornwallis and Griffin 2024). There are also implications beyond fundamental \nresearch, e.g. in the context of detecting and reporting biodiversity change globally, data and \nknowledge are critical for the measurement framework of Essential Biodiversity Variables \n(EBVs), where TDM and NLP tools and services could contribute espec ially to informing \n“species traits” EBVs (Kissling et al. 2018). While much of the knowledge about biodiversity \ncollected and published over centuries remains largely not machine-readable, digitisation efforts \nand open science initiatives are contributing to the opening up of biodiversity literature (Agosti et \nal. 2024). Therefore, the continued development and enhancement of specialist and generalist \nbiodiversity literature mining tools and resources is required to serve researcher needs as well \nas to inform assessments and guide policy decisions on the protection and restoration of \nbiological diversity for a sustainable future.   \n \n \n \n \n \n \n \n \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n28 \n \n \n \n \n \nAcknowledgements \nThe authors thank Morgane Massy from the University of Lausanne for participating in the \nannotation of the gold-standard set of curated articles used for training and assessing \nperformance, and Guido Sautter from Plazi for preparing and sharing the arthropod taxonomic \ntreatments from TreatmentBank. \n \n \nFunding  \nThis work was primarily funded through Swiss National Science Foundation (SNSF) Spark grant \n196125 to RMW. The authors acknowledge additional support from SNSF grants 202669 (to \nRMW). \n \n \nConflict of interest \nThe authors declare no conflicts of interest. \n \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n29 \nReferences \nAbdelmageed N, Algergawy A, Samuel S, König-Ries B (2021) BiodivOnto: Towards a core \nontology for biodiversity. In: Verborgh R, Dimou A, Hogan A, d’Amato C, Tiddi I, Bröring \nA, Mayer S, Ongenae F, Tommasini R, Alam M (Eds), The Semantic Web: ESWC 2021 \nSatellite Events. Lecture Notes in Computer Science. Springer International Publishing, \nCham, 3–8. https://doi.org/10.1007/978-3-030-80418-3_1  \nAbdelmageed N, Löffler F, Feddoul L, Algergawy A, Samuel S, Gaikwad J, Kazem A, König-\nRies B (2022) BiodivNERE: Gold standard corpora for named entity recognition and \nrelation extraction in the biodiversity domain. Biodiversity Data Journal 10: e89481. \nhttps://doi.org/10.3897/BDJ.10.e89481  \nAgosti D, Bénichou L, Casino A, Nielsen L, Ruch P, Kishor P, Penev L, Mergen P, Arvanitidis C \n(2024) Liberate the power of biodiversity literature as FAIR digital objects. Research \nIdeas and Outcomes 10: e126586. https://doi.org/10.3897/rio.10.e126586 \n \nAgosti D, Benichou L, Addink W, Arvanitidis C, Catapano T, Cochrane G, Dillen M, Döring M, \nGeorgiev T, Gérard I, Groom Q, Kishor P, Kroh A, Kvač ek J, Mergen P, Mietchen D, \nPauperio J, Sautter G, Penev L (2022) Recommendations for use of annotations and \npersistent identifiers in taxonomy and biodiversity publishing. Research Ideas and \nOutcomes 8: e97374. https://doi.org/10.3897/rio.8.e97374 \n \nAllio R, Nabholz B, Wanke S, Chomicki G, Pérez-Escobar OA, Cotton AM, Clamens A-L, \nKergoat GJ, Sperling FAH, Condamine FL (2021) Genome-wide macroevolutionary \nsignatures of key innovations in butterflies colonizing new host plants. Nature \nCommunications 12: 354. https://doi.org/10.1038/s41467-020-20507-3 \n \nBánki O, Roskov Y, Döring M, Ower G, Robles DRH, Corredor CAP, Jeppesen TS, Örn A, Pape \nT, Hobern D, Garnett S, Little H, DeWalt RE, Ma K, Miller J, Orrell T, Aalbu R, Abbott J, \nAedo C, Aescht E, Alexander S, Alonso-Zarazaga MA, Alvarez B, Andrella GC, \nAntonietto LS, Arango C, Artois T, Burgos MA, Atkinson S, Atwood JJ, Sartori ÂLB, \nBailly N, Baixeras J, Baker E, Balan A, Bamber R, Bandyopadhyay S, Barber-James H, \nPinto RB, Barrett R, Bartolozzi L, Bartsch I, Beccaloni G, Bellamy CL, Bellan-Santini D, \nBellinger PF, Ben-Dov Y, Blasco-Costa I, Boatwright JS, Bock P, Bolton B, Borges LM, \nBortoluzzi R, Bossard RL, Bota-Sierra C, Bouchard P, Bourgoin T, Boury-Esnault N, \nBoxshall G, Boyko C, Brandão S, Braun H, Bray R, Brehm G, Brinda JC, Brock PD, \nBroich SL, Brown J, Brown S, Bruce N, Brullo S, Bruneau A, Bush L, Büscher T, \nBła\nż ewicz-Paszkowycz M, Cabras A, Cairns S, Calonje M, Cardinal-McTeague W, \nCardoso D, Cardoso L, Castilho RC, Silva ICC, Cervantes A, Chernyshev A, Chevillotte \nH, Choo LM, Christiansen KA, Cianferoni F, Cigliano MM, Clarke R, Monteiro TC e, \nCollins A, Compton J, Copila\n/i3 -Ciocianu D, Corbari L, Cordeiro R, Cortés-Hernández K, \nCostello M, Crameri S, Cruz-López JA, Cárdenas P, Daly M, Daneliya M, Dauvin J-C, \nDavie P, Broyer CD, Grave SD, Lima HCD, Prins JD, Prins WD, Sousa FD, Estrella MD \nla, DeSalle R, Decker P, Decock W, Delgado-Salinas A, Deliry C, Dellapé PM, Heyer \nJD, Dijkstra K-D, Dmitriev DA, Dohrmann M, Dorado Ó, Dorkeld F, Downey R, Duan L, \nDíaz M-C, Eades DC, Egan AN, Eitel M, Nagar AE, Emig CC, Engel MS, Garrote PE, \nEvans GA, Evenhuis NL, Falcão M, Farruggia F, Fauchald K, Fautin D, Favret C, Fisher \nB, Fišer C, Forró L, Fortuna-Perez AP, Fortune-Hopkins H, Fritsch P, Froese R, Fuchs \nA, Fujimoto S, Furuya H, Gagnon E, Garic R, Gasca R, Gattolliat J-L, Gerken S, Lima \nAG de, Gibson D, Gielis C, Gilligan T, Giribet G, Duque JCG, Gittenberger A, Galdo GG \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n30 \ndel, Gofas S, Goncharov M, Gondim AI, Goodwin C, Govaerts R, Grabowski M, \nGranado A de A, Gregório B de S, Grehan JR, Grether R, Grimaldi DA, Gross O, \nGuerra-García JM, Guglielmone A, Guilbert E, Frøslev TG, Gusenleitner J, Haas F, \nHadfield KA, Hajdu E, Hassler M, Hastriter MW, Hauser C, Hausmann A, Hayward BW, \nHendrycks E, Henry TJ, Hernandes FA, Hernández-Crespo JC, Hine A, Ho B-C, Hodson \nA, Hoeksema B, Hoenemann M, Holstein J, Hooge M, Hooper J, Hopkins H, Horak I, \nHorton T, Hošek J, Hughes C, Hughes L, Huys R, Häuser C, Janssens F, Jaume D, \nJavadi F, Jazdzewski K, Jersabek CD, Johnson KP, Jordão L, Jó\nź wiak P, Kajihara H, \nKakui K, Kallies A, Kamiń ski MJ, Kanda K, Karanovic I, Kathirithamby J, Kelly M, Kim Y-\nH, King R, Kirk P, Kitching I, Klautau M, Klitgaard BB, Koenemann S, Korovchinsky NM, \nKotov A, Kramina T, Krapp-Schickel T, Kremenetskaia A, Krishna K, Krishna V, Kroh A, \nKroupa AS, Kury AB, Kury MS, Kvač ek J, Lachenaud O, Lado C, Lambert G, Atunes \nLLC, Lavin M, Lazarus D, Coze FL, Roux ML, LeCroy S, Linares JL, Lee S, Leitner MF, \nLewis GP, Li S-J, Li-Qiang J, Lichtwardt R(†), Lim S-C, Littlewood T, Lohrmann V, \nLonghorn SJ, Lorenz W, Lowry J, Lozano F, Lumen R, Lyal CH, Lörz A-N, Madin L, \nMagnien P, Mah C, Mal N, Mamos T, Manconi R, Mansano V, Markello K, Martens K, \nMartin JH, Martin P, Mashego KS, Maslakova S, Maslin B, Mattapha S, McFadden C, \nMcKamey S, McMurtry JA, Medrano MA, Mees J, Mendes AC, Merrin K, Mesa NC, \nMessing C, Mielke CGC, Migeon A, Miller DR, Mills C, Minelli A, Mitchell D, Molodtsova \nT, Valls JFM, Mooi R, Morandini A, Rocha RM da, Morrow C, Moteetee A, Murillo-\nRamos L, Murphy B, Narita JPZ, Nery DG, Neu-Becker U, Neuhaus B, Newton A, Lin \nPNK, Nicolson D, Nielsen JE, Nijhof A, Nishikawa T, Norenburg J, O’Hara T, Ochoa R, \nOhashi H, Ohashi K, Ollerenshaw J, Oosterbroek P, Opresko D, Osborne R, Osigus H-J, \nOswald JD, Ota Y, Otte D, Ouvrard D, Queiroz LP de, Pandey A, Paulay G, Paulson D, \nPauly D, Pennington RT, Pereira J da S, Perez-Gelabert D, Petrusek A, Phillipson P, \nPinheiro U, Morim MP, Pisera A, Pitkin B, Plotkin D, Pierezan BP, Poore G, Povydysh \nM, Praxedes RA, Pulawski WJ, Pyle R, Pühringer F, Rajaei H, Rakotonirina N, Ramos \nG, Rando J, Filardi FR, Raz L, Read G, Rees T, Reich M, Reimer JD, Rein JO, \nReynolds J, Rincón J, Rius M, Robertson T, Robinson G, Robinson GS(†), Rodríguez E, \nRuggiero M, Ríos P, Rützler K, Sanborn A, Sanjappa M, Santos SG, Santos-Guerra A, \nSartori M, Sattler K, Schierwater B, Schilling S, Schley R, Schmid-Egger C, Schmidt-\nRhaesa A, Schoolmeesters P, Schorr M, Schrire B, Schuchert P, Schuh RT, Schönberg \nC, Rodrigues RS, Scoble M, Seijo G, Seleme EP, Senna A, Serejo C, Sforzi A, Shenkar \nN, Shimizu G, Siegel V, Sierwald P, Sihvonen P, Flores AS, Carvalho CS de, Simon MF, \nSimonsen T, Simpson CE, Sinniger F, Sirichamorn Y, Skvarla M, Smith AD, Smith VS, \nGissi DS, Sokoloff D, Sotuyo S, Soulier-Perkins A, South EJ, Souza-Filho JF, Spearman \nL, Spelda J, Steiner A, Stemme T, Sterrer W, Stevenson D, Stiewe MBD, Stirton CH, \nStraub S, Stueber G, Stöhr S, Subramaniam S, Swalla B, Swedo J, Sánchez-Ruiz M, \nSørensen MV, Taiti S, Takiya DM, Tandberg AH, Tavakilian G, Taylor K, Thessen A, \nThomas JD, Thomas P, Thomson S, Thuesen E, Thulin M, Thurston M, Thuy B, Todaro \nA, Torke BM, Tsai S-Y, Turiault M, Turner JRG, Turner T, Turon X, Tyler S, Uetz P, \nUlmer JM, Vacelet J, Vachard D, Vader W, Domedel GV, Burgt XV der, Vandepitte L, \nVanhoorne B, Vatanparast M, Verhoeff T, Vonk R, Väinölä R, Walker-Smith G, Walter \nTC, Wambiji N, Wanke D, Watling L, Weaver H, Webb J, Welbourn WC, Whipps C, \nWhite K, Wilding N, Williams G, Wilson AJG, Wing P, Winitsky S, Wirth CC, \nWojciechowski M, Woodman S, Xavier J, Yi T, Yoder M, Yu DSK, Yunakov N, Zahniser \nJ, Zeidler W, Zhang R, Zhang ZQ, Zinetti F, d’Hondt J-L, Moraes GJ de, Oliveira ABR \nde, Voogd N de, Río MG del, Haaren T van, Nieukerken EJ van, Ofwegen L van, Soest \nR van, \nŞ entürk O (2024) Catalogue of Life. Version 2024-12-19. \nhttps://doi.org/10.48580/dglq4  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n31 \nBasaldella M, Furrer L, Tasso C, Rinaldi F (2017) Entity recognition in the biomedical domain \nusing a hybrid approach. Journal of Biomedical Semantics 8: 51. \nhttps://doi.org/10.1186/s13326-017-0157-6  \nButtigieg PL, Pafilis E, Lewis SE, Schildhauer MP, Walls RL, Mungall CJ (2016) The \nenvironment ontology in 2016: bridging domains with increased scope, semantic density, \nand interoperation. Journal of Biomedical Semantics 7: 57. \nhttps://doi.org/10.1186/s13326-016-0097-6 \n \nCejuela JM, McQuilton P, Ponting L, Marygold SJ, Stefancsik R, Millburn GH, Rost B, the \nFlyBase Consortium (2014) Tagtog: Interactive and text-mining-assisted annotation of \ngene mentions in PLOS full-text articles. Database 2014: bau033–bau033. \nhttps://doi.org/10.1093/database/bau033  \nCERN, OpenAIRE (2024) Zenodo. https://doi.org/10.25495/7GXK-RD71  \nChang A, Jeske L, Ulbrich S, Hofmann J, Koblitz J, Schomburg I, Neumann-Schaal M, Jahn D, \nSchomburg D (2021) BRENDA, the ELIXIR core data resource in 2021: new \ndevelopments and updates. Nucleic Acids Research 49: D498–D508. \nhttps://doi.org/10.1093/nar/gkaa1025 \n \nChurch SH, Donoughe S, De Medeiros BAS, Extavour CG (2019) A dataset of egg size and \nshape from more than 6,700 insect species. Scientific Data 6: 104. \nhttps://doi.org/10.1038/s41597-019-0049-y  \nCohen J (1960) A coefficient of agreement for nominal scales. Educational and Psychological \nMeasurement 20: 37–46. https://doi.org/10.1177/001316446002000104  \nComeau DC, Islamaj Dogan R, Ciccarese P, Cohen KB, Krallinger M, Leitner F, Lu Z, Peng Y, \nRinaldi F, Torii M, Valencia A, Verspoor K, Wiegers TC, Wu CH, Wilbur WJ (2013) BioC: \na minimalist approach to interoperability for biomedical text processing. Database 2013: \nbat064–bat064. https://doi.org/10.1093/database/bat064  \nCornwallis CK, Griffin AS (2024) A guided tour of phylogenetic comparative methods for \nstudying trait evolution. Annual Review of Ecology, Evolution, and Systematics 55: 181–\n204. https://doi.org/10.1146/annurev-ecolsys-102221-050754  \nDevlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional \ntransformers for language understanding. Proceedings of the 2019 Conference of the \nNorth: 4171–4186. https://doi.org/10.18653/v1/N19-1423  \nFarrell MJ, Brierley L, Willoughby A, Yates A, Mideo N (2022) Past and future uses of text \nmining in ecology and evolution. Proceedings of the Royal Society B: Biological \nSciences 289: 20212721. https://doi.org/10.1098/rspb.2021.2721  \nFarrell MJ, Le Guillarme N, Brierley L, Hunter B, Scheepens D, Willoughby A, Yates A, Mideo N \n(2024) The changing landscape of text mining: a review of approaches for ecology and \nevolution. Proceedings of the Royal Society B: Biological Sciences 291: 20240423. \nhttps://doi.org/10.1098/rspb.2024.0423 \n \nFeron R, Waterhouse RM (2022a) Assessing species coverage and assembly quality of rapidly \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n32 \naccumulating sequenced genomes. GigaScience 11: giac006. \nhttps://doi.org/10.1093/gigascience/giac006  \nFeron R, Waterhouse RM (2022b) Exploring new genomic territories with emerging model \ninsects. Current Opinion in Insect Science 51: 100902. \nhttps://doi.org/10.1016/j.cois.2022.100902  \nFurrer L, Cornelius J, Rinaldi F (2022) Parallel sequence tagging for concept recognition. BMC \nBioinformatics 22: 623. https://doi.org/10.1186/s12859-021-04511-y  \nGaimari SD (2017) The dipteran family Celyphidae in the New World, with discussion of and key \nto world genera (Insecta, Diptera). ZooKeys 711: 113–130. \nhttps://doi.org/10.3897/zookeys.711.20840  \nGrimaldi DA, Engel MS (2005) Evolution of the insects. Cambridge university press, Cambridge.  \nGuidoti M, Sokolowicz C, Simoes F, Gonçalves V, Ruschel T, Alvares D, Agosti D (2021) \nTreatmentBank: Plazi’s strategies and its implementation to most efficiently liberate data \nfrom scholarly publications. Biodiversity Information Science and Standards 5: e75690. \nhttps://doi.org/10.3897/biss.5.75690  \nHedrick BP, Heberling JM, Meineke EK, Turner KG, Grassa CJ, Park DS, Kennedy J, Clarke \nJA, Cook JA, Blackburn DC, Edwards SV, Davis CC (2020) Digitization and the future of \nnatural history collections. BioScience 70: 243–251. https://doi.org/10.1093/biosci/biz163  \nJaron KS, Parker DJ, Anselmetti Y, Tran Van P, Bast J, Dumas Z, Figuet E, François CM, \nHayward K, Rossier V, Simion P, Robinson-Rechavi M, Galtier N, Schwander T (2022) \nConvergent consequences of parthenogenesis on stick insect genomes. Science \nAdvances 8: eabg3842. https://doi.org/10.1126/sciadv.abg3842 \n \nKeck F, Broadbent H, Altermatt F (2025) Extracting massive ecological data on state and \ninteractions of species using large language models. \nhttps://doi.org/10.1101/2025.01.24.634685 \n \nKim Sang EFT, De Meulder F (2003) Introduction to the CoNLL-2003 shared task: Language-\nindependent named entity recognition. In: Proceedings of the Seventh Conference on \nNatural Language Learning at HLT-NAACL 2003. , 142–147. Available from: \nhttps://aclanthology.org/W03-0419/. \n \nKissling WD, Walls R, Bowser A, Jones MO, Kattge J, Agosti D, Amengual J, Basset A, Van \nBodegom PM, Cornelissen JHC, Denny EG, Deudero S, Egloff W, Elmendorf SC, \nAlonso García E, Jones KD, Jones OR, Lavorel S, Lear D, Navarro LM, Pawar S, Pirzl \nR, Rüger N, Sal S, Salguero-Gómez R, Schigel D, Schulz K-S, Skidmore A, Guralnick \nRP (2018) Towards global data products of Essential Biodiversity Variables on species \ntraits. Nature Ecology & Evolution 2: 1531–1540. https://doi.org/10.1038/s41559-018-\n0667-3 \n \nLe Guillarme N, Thuiller W (2022) TaxoNERD: Deep neural models for the recognition of \ntaxonomic entities in the ecological and evolutionary literature. Methods in Ecology and \nEvolution 13: 625–641. https://doi.org/10.1111/2041-210X.13778  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n33 \nLee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2020) BioBERT: a pre-trained biomedical \nlanguage representation model for biomedical text mining. Wren J (Ed.). Bioinformatics \n36: 1234–1240. https://doi.org/10.1093/bioinformatics/btz682  \nLever J, Altman R, Kim J-D (2020) Extending TextAE for annotation of non-contiguous entities. \nGenomics & Informatics 18: e15. https://doi.org/10.5808/GI.2020.18.2.e15  \nLi Y, Ramprasad R, Zhang C (2024) A simple but effective approach to improve structured \nlanguage model output for information extraction. \nhttps://doi.org/10.48550/ARXIV.2402.13364  \nLiu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V \n(2019) RoBERTa: A robustly optimized BERT pretraining approach. Available from: \nhttp://arxiv.org/abs/1907.11692 (December 16, 2024).  \nMammola S, Pavlek M, Huber BA, Isaia M, Ballarin F, Tolve M, Č upić  I, Hesselberg T, Lunghi E, \nMouron S, Graco-Roza C, Cardoso P (2022) A trait database and updated checklist for \nEuropean subterranean spiders. Scientific Data 9: 236. https://doi.org/10.1038/s41597-\n022-01316-3 \n \nMarcos D, van de Vlasakker R, Athanasiadis IN, Bonnet P, Goeau H, Joly A, Kissling WD, \nLeblanc C, van Proosdij ASJ, Panousis KP (2024) Fully automatic extraction of \nmorphological traits from the Web: utopia or reality? \nhttps://doi.org/10.48550/ARXIV.2409.17179  \nMcCallen E, Knott J, Nunez/i3 Mir G, Taylor B, Jo I, Fei S (2019) Trends in ecology: shifts in \necological research themes over the past four decades. Frontiers in Ecology and the \nEnvironment 17: 109–116. https://doi.org/10.1002/fee.1993  \nMcHugh ML (2012) Interrater reliability: the kappa statistic. Biochemia Medica 22: 276–282.  \nMontagna M (2011) Pachybrachis sassii, a new species from the Mediterranean Giglio Island \n(Italy) (Coleoptera, Chrysomelidae, Cryptocephalinae). ZooKeys 155: 51–60. \nhttps://doi.org/10.3897/zookeys.155.1951  \nMontani I, Honnibal M, Boyd A, Van Landeghem S, Peters H (2023) explosion/spaCy: v3.7.2: \nFixes for APIs and requirements. https://doi.org/10.5281/ZENODO.1212303  \nMündler N (2024) nielstron/quantulum3. Available from: https://github.com/nielstron/quantulum3 \n(November 25, 2024).  \nMungall C, Matentzoglu N, Balhoff J, Osumi-Sutherland D, Duncan B, Pgaudet, Tan S, Hoyt CT, \nPilgrim C, Overton JA, Lauren, Caron A, Nomi Harris, Moxon S, Lschriml, Vasilevsky N, \nToro S, Goutte-Gattat D, Brush M, Vasundra Touré, Bretaudeau A, Cain S, Haendel M, \nDiatomsRcool, Bide Zhang, Dowland C, Dooley D, Actions-User, Hammock J (2023) \nThe OBO relation ontology, http://purl.obolibrary.org/obo/ro.owl. \nhttps://doi.org/10.5281/ZENODO.593101 \n \nMungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA (2012) Uberon, an integrative multi-\nspecies anatomy ontology. Genome Biology 13: R5. https://doi.org/10.1186/gb-2012-13-\n1-r5 \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n34 \nParr CS, Wilson N, Leary P, Schulz K, Lans K, Walley L, Hammock J, Goddard A, Rice J, \nStuder M, Holmes J, Corrigan, Jr. R (2014) The Encyclopedia of Life v2: Providing global \naccess to knowledge about life on Earth. Biodiversity Data Journal 2: e1079. \nhttps://doi.org/10.3897/BDJ.2.e1079  \nPasche E, Agosti D, Penev L, Groom Q, Flament A, Gobeill J, Ruch P (2023a) Towards \n“Biodiversity PMC.” Biodiversity Information Science and Standards 7: e111647. \nhttps://doi.org/10.3897/biss.7.111647 \n \nPasche E, Gobeill J, Agosti D, Penev L, Groom Q, Georgiev T, Gaillac E, Flament A, \nCaucheteur D, Michel P-A, Ruch P (2023b) From SIBiLS to Biodiversity PMC: \nFoundations for the One Health Library. Biodiversity Information Science and Standards \n7: e111660. https://doi.org/10.3897/biss.7.111660  \nRamshaw LA, Marcus MP (1999) Text chunking using transformation-based learning. In: \nArmstrong S, Church K, Isabelle P, Manzi S, Tzoukermann E, Yarowsky D (Eds), \nNatural Language Processing Using Very Large Corpora. Text, Speech and Language \nTechnology. Springer Netherlands, Dordrecht, 157–176. https://doi.org/10.1007/978-94-\n017-2390-9_10  \nRosonovski S, Levchenko M, Ide/i3 Smith M, Faulk L, Harrison M, McEntyre J (2023) Searching \nand evaluating publications and preprints using Europe PMC. Current Protocols 3: e694. \nhttps://doi.org/10.1002/cpz1.694  \nShirey V, Larsen E, Doherty A, Kim CA, Al-Sulaiman FT, Hinolan JD, Itliong MGA, Naive MAK, \nKu M, Belitz M, Jeschke G, Barve V, Lamas G, Kawahara AY, Guralnick R, Pierce NE, \nLohman DJ, Ries L (2022) LepTraits 1.0 A globally comprehensive dataset of butterfly \ntraits. Scientific Data 9: 382. https://doi.org/10.1038/s41597-022-01473-5 \n \nStork NE (2018) How many species of insects and other terrestrial arthropods are there on \nEarth? Annual Review of Entomology 63: 31–45. https://doi.org/10.1146/annurev-ento-\n020117-043348 \n \nWong MKL, Guénard B, Lewis OT (2019) Trait/i3 based ecology of terrestrial arthropods. \nBiological Reviews 94: 999–1022. https://doi.org/10.1111/brv.12488  \nYamada I, Asai A, Shindo H, Takeda H, Matsumoto Y (2020) LUKE: Deep contextualized entity \nrepresentations with entity-aware self-attention. In: Proceedings of the 2020 Conference \non Empirical Methods in Natural Language Processing (EMNLP). Association for \nComputational Linguistics, Online, 6442–6454. https://doi.org/10.18653/v1/2020.emnlp-\nmain.523  \n \n \n \n \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint \n\n \n35 \nSupplementary materials \n \nSupplementary File S1: Curated trait dictionaries \nAn MS Excel spreadsheet presenting the lists of trait dictionaries for feeding ecology, habitat, \nand morphology, with links to the source resources, synonyms, and definitions. \n \nSupplementary File S2: Annotator guidelines \nA PDF file of the notes and guidelines developed by the annotators during the curation of the \ngold-standard annotation data. \n \nSupplementary File S3: Gold-standard annotated documents \nAn MS Excel spreadsheet listing the annotated files, the number of annotations for each type, \nand the corresponding annotators. \n \nSupplementary File S4: NER and RE baseline results \nAn MS Excel spreadsheet containing five tables with exact scores for all configurations, in terms \nof recall, precision, and F-score values, along with the corresponding support for each class and \nthe macro and weighted averages. \n \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}