Results
A Workflow for Annotating Arthropod Organismal and Ecological Traits
The analytical workflow for processing and annotating thousands of articles to identify
organismal and ecological traits of arthropods (Figure 1) consists of several key data
preparation steps (ATResourceManager and Domain Expert curation) and model training
procedures (ATTrainer), in or der to subsequently perform the text mining tasks (ATMiner) to
produce the predictions and upload them for viewing in ArTraDB. Firstly, the domain expert
curation tasks resulted in two key outputs: the Gold Standard Annotations and the Curated Trait
Vocabularies. A set of selected documents was manually annotated by domain experts to
provide a resource for downstream training and for assessing the performance of the text
mining tasks (see Methods). The domain experts also built curated trait vocabularies (including
synonyms) covering the three categories of feeding ecology (n=81), habitat (n=184), and
morphology (n=125), based on combinations of existing ontologies and online resources (see
Methods). In parallel, the ATResourceManager preparation steps were developed to: (1)
process the taxonomic treatment documents from Plazi and retrieve the corresponding
publications from PMC; (2) extract from the Catalogue of Life taxonomy all accepted arthropod
species and their higher-level taxonomic names; and (3) extract from the Encyclopedia of Life
traits database all available taxon-trait annotations for arthropods (see Methods for details).
Subsequently, the ATTrainer language model training steps take as input the Gold Standard
Annotations (TRAIN-GOLD subset) for the fine-tuning of the BioBERT model and for the training
of the LUKE model (see Methods). These models are then used in the ATMiner tasks for
Named Entity Recognition (NER) with BioBERT and Relationship Extraction (RE) with LUKE,
also using the curated trait vocabularies to perform entity normalisations using OGER (see
Methods). The resulting predicted annotations - the entities of arthropods, traits, and values -
and the arthropod-trait and trait-value relationships were then imported into the ArTraDB web
resource where they can be reviewed by the community.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
10
Figure 1: The arthrop od orga nismal an d ec ologic al traits a nn otati on w orkflow.
Th e work flow s tart s with cur ati on per form e d by dom ai n ex p erts re sulti n g in en tity an d rel ati on shi p a n no ta tio ns f or a s ele ct e d
su bse t o f p u blica tio ns as w ell as c ur at ed v oc a b ulari es o f s ets of org a nismal a nd ec olo gic al trai t s. Th e ATR eso urc eM an a ge
ste ps i ncl ud e t h e pr oc essi n g o f d at a s o urc ed from th e C at alo g ue of Lif e (t ax o nom y) a n d t he E ncycl o pe di a o f Li fe ( art hro p od- trai
relati o nshi ps) to ge n erat e ta xo n a nd tr ait dicti o nari es, a s w ell as t h e r etrie val o f pu blic ati ons f or proc essi ng fr om Pu bM ed C e nt r a
ba se d on th e s ele cti on of all Pl azi Tr ea tme nt B an k recor ds for art hro po ds . Th e ex per t-g en er at ed Gol d S ta n dar d A n no ta tio ns ar e
u se d a s i npu t t o t r a in (A T T ra i ne r s tep s ) Na tu r a l L ang uag e P ro ce s s in g ( N LP ) mo de l s f o r th e N a me d En t i ty Re co gn i t io n ( N E R
an d R el ati ons hi p Extr acti o n (RE) t asks , wit h th e trait v oc ab ulari es a nd t ax a dicti on arie s bei n g u se d f or en tity n orm alisati o
n
(ATMin er s te ps) . Fi nally , th e pr edic te d a nn ot ati o ns ar e m a de a vail abl e to t h e sci e ntific c ommu nit y vi a th e ArTr aD B w eb re so urc e
wher e c ommu nity c ur at ors co ul d p ot en tially provi de c orre ctio ns t o t he a nn ot ati on s th at c a n lat er b e us e d for re fin em en t o f th e
NER a nd ER mo d els ( do tt ed lin es).
10
d
r
t
a l
e
)
n
e
e
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
11
Entity and Relationship Annotation of PubMedCentral Articles
Annota tio n Resul ts f or Entity a nd Relat ion ship Disc overy
The application of the workflow presented in Figure 1 to a total of 2’000 publications sourced
from PMC resulted in the annotation of 656’403 entities (arthropods, traits, and values) and
339’463 relationships (hasTrait, hasValue), summarised in Figure 2. The PMC articles range in
lengths from 173 to 27’466 characters with a median of 15’720 and an interquartile range from
11’452 to 20’506 (Figure 2A). The densities of entity and relationship predictions are highest for
entities of type “value” and hasValue relationships, with medians of ~10 annotations per 1’000
characters, arthropod and trait entities have median densities of four and six annotations per
1’000 characters respectively, and hasTrait relationships has the lowest density with a median
of zero annotations per 1’000 characters (Figure 2B). In contrast, the 25 articles comprising the
gold standard annotation dataset show median densities of 4.9, 6.4, and 6.1 annotations per
1’000 characters for arthropod, trait, and value entities, respectively, and the hasTrait and
hasValue relationships show medians of 6.4 and 6.1 annotations, respectively (Figure 2C).
These manually annotated documents (17 of which were complete articles and eight only
abstracts) contained in total 4’990 named entities (1’069 arthropods, 2’078 traits, and 1’843
values) and 3'628 relationships (1’777 hasTrait and 1’851 hasValue). For the predicted
annotations, the total numbers of entities and relationships are generally higher in longer
documents, reaching over 900 and over 600 annotations, respectively (Figure 2D). This trend is
replicated when considering each entity and relationship subtype separately, with the largest
numbers of annotations identified in some of the longest documents, reaching maxima of 456,
393, and 669 for arthropods, traits, and values, and 100 and 717 for hasTrait and hasValue,
respectively (Figure 2E). At minimum, a publication should contain one taxonomic treatment
describing a single arthropod species, e.g. “ Pachybrachis sassii, a new species from the
Mediterranean Giglio Island (Italy) (Coleoptera, Chrysomelidae, Cryptocephalinae)” (Montagna
2011) (length: 13’550 characters; annotated entities: 63 arthropod, 79 trait, 149 value).
However, the much longer articles generally describe a whole group of species for a particular
region, e.g. “The dipteran family Celyphidae in the New World, with discussion of and key to
world genera (Insecta, Diptera)” (length: 27’381 characters; annotated entities: 172 arthropod,
141 trait, 246 value) contains 92 taxonomic treatments (Gaimari 2017).
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
12
Figure 2: Di stribu tion s of PMC article prop erties and the resu ltin g en tity a nd relati on ship a nno tati ons.
(A ) The i n pu t d at as et c o nsist e d o f 2’ 0 00 PMC article s e xhi biti ng a bro ad ch ara ct er (ch ars) le n gt h dis trib uti on . T he r ela tiv e
num b er o f r esul tin g pre dicti o ns of a nn ot at e d en titi es ( arthr o po ds , tr aits, v alu es) an d r ela tio ns hip s (t a xo n t o tr ait - h asTr ait, trai t t o
valu e - ha sV alu e) are s h ow n f or t he w h ole d at as et ( B ) a nd fo r the 25 g o ld - s tan da r d m anu a l l y ann ot at ed d oc u men t s ( C ). Th e
ab sol ut e n um ber o f pr edic te d e nti ty a nd r ela tio ns hip an n ot atio ns c om par ed t o t he d ocum e nt le n gt h s in ch arac ters is s ho wn f o
an n ot ati on ty pe s ( D ) an d su bty p es ( E) . Bo x p lo ts in p ane l s A , B , and C s ho w t he med i an , f i r s t an d th i r d qu a rt i l e s, and lo w e r an d
up p er extr eme s of th e distri bu tio n ( 1. 5 × I nt erq u art ile ra n ge).
12
e
o
e
r
d
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
13
Assessing the C omplexity of t he T ask by E xamining I nter-Ann ota tor Agreement
To begin to interpret the prediction results from the workflow it is important to understand the
complexity of the annotation task itself, insights into which can be gained by examining the
levels of agreement between the curated annotations gene rated by the two domain experts.
The five documents that were annotated by both domain experts included two complete
publications and three abstract-only articles (Figure 3). In total for these five documents, the two
experts annotated 1’477 named entitie s (161 arthropods, 764 traits, 552 values), with annotator
1 identifying 80 arthropods, 416 traits, and 334 values, and with annotator 2 identifying 81
arthropods, 348 traits, and 218 values. They also annotated a total of 1’094 relationships (553
hasTait, 541 hasValue), with annotator 1 identifying 343 hasTrait and 343 hasValue
relationships, and with annotator 2 identifying 210 hasTrait and 198 hasValue relationships.
Cohen’s kappa is used to measure inter-annotator reliability, or concordance, to assess the
degree of agreement amongst independent observers who rate, label, or classify the same
phenomenon, with values below 0.6 generally indicating inadequate agreement (McHugh 2012).
Amongst the five documents curated by both annotators, Cohen’s kappa valu es reflect varying
levels of inter- annotator agreement for entities (Figure 3B), from poor agreement (~0.35), to
moderate agreement (~0.5), to substantial agreement (~0.8), with lower agreement levels for
relationships (Figure 3C). While Cohen’s kappa scor es provide a standardised measure of
agreement, their values must be carefully interpreted within the study’s context, where here they
serve to highlight the complexity of the annotation tasks.
Figure 3: Inter-a nn otat or agreement o f five documen ts an no tated by bot h experts .
(A ) Th e bars s ho w doc um en t l en gt hs in c har act ers (c har s). ( B ) The b a rs sh ow the le v e l of a nnot a t ion a g re e men t fo r en t it i e s
(tax on , tr ait, or v alu e) be twe e n t h e tw o an n ot at ors as m e asur e d u sin g C oh e n’s k ap p a. ( C ) T h e b ars sh ow t he lev el of an n ot ati o n
a g ree m en t fo r r e la t io ns ( ha s T ra i t o r ha sV a lu e ) be tw e en t he t wo a nno ta to r s a s mea s u red u s i ng C ohen ’ s kap pa .
13
he
he
ts.
te
o
tor
81
53
ue
s.
he
e
2).
ng
to
for
of
ey
s
n
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
14
Assessing En tity N ormalisation wi th t he T axon a nd Trai t Dictio naries
Entity normalisation, or linking, which was performed using OGER, is the process of matching
the entities that were annotated in the articles with the dictionaries of arthropods (taxa from the
Catalogue of Life) and traits (the collated sets of feeding ecology, habitat, and morphology
traits), with the goal of assigning to each labelled entity a unique identifier from one of the input
resources. It is important to understand the performance of the normalisation task in order to
interpret the quality of the entity prediction results from the workflow. The taxon dictionary
contained a total of 1’015’642 arthropod species and 118’008 higher-level taxonomic names
and the trait dictionary contained a total of 390 traits: 81 feeding ecology; 184 habitat; 125
morphology (see Methods). Focusing on the identification and quantification of taxon and trait
entities within the article corpus, the coverage and frequency of entities mapped to the
predefined dictionaries and those that could not be mapped provide an assessment of the entity
normalisation process and the comprehensiveness and relevance of the dictionaries (Figure 4).
Across the 2’000 articles processed by the workflow, a total of 128’149 taxon entities were
annotated with 63% matching terms in the dictionary (mapped to Concept IDs), comprising
24’207 species and 56’532 higher-level taxonomic names (Figure 3A). Notably, taxa such as the
order Hymenoptera (sawflies, wasps, bees, and ants), or the genus Tipula (crane flies), were
amongst the most frequently annotated entities, with 388 and 384 occurrences, respectively.
Reviewing some of the 47’312 non-mapped taxon entities revealed examples of correctly
identified arthropod taxa which nevertheless are not included in the Catalogue of Life and are
therefore not in the dictionary, e.g. the genera Micrencaustes (beetles) and Deinodryinus
(parasitoid wasps). Of the 199’276 (28’348 unique) trait entities annotated, 12.7% were
successfully mapped to the trait dictionary (linked to Concept IDs), with 1’366, 85, and 23’816
entities mapping to feeding ecology, habitat, and morphology terms, respectively (Figure 3A).
Note that the NER task labels entities as taxa, traits, or values, further categorisation of traits
into feeding ecology, habitat, and morphology terms is only possible when entity normalisation
is successful. Feeding and morphology traits such as “host” (273), “mouth” (16), and “legs”
(1’663) were amongst the most prevalent annotated entities. Many annotated trait entities could
not be mapped (168’237), one of the most frequent being “distribution”, with 2’506 occurrences.
Considering instead the numbers of unique terms in the dictionaries that could be annotated
and linked in the articles, only 14’243 of the 1.2 million arthropods in the dictionary (1.2%) were
identified, and from the trait dictionary of feeding ecology, habitat, and morphology terms,
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
15
28.4%, 3.3%, and 60%, respectively, were annotated and linked (Figure 3B). Failure to identify
dictionary terms in the annotated articles may be because the terms are simply not present (it
cannot be expected that a small subset of 2’000 arti cles will contain mentions of all 1.2 million
described arthropod species), or because normalisation was unable to link terms and phrases
recognised as entities in the texts with the terms, phrases, and synonyms that make up the
concepts of the dictionaries.
Figure 4: Tax on a nd trait dicti onaries c ompared with ann ota ted e ntiti es.
For t h e 2' 0 00 PM C article s a nalys e d: t he pr o po rtion s of all a nn ot at e d e ntiti es th at c oul d b e m a pp e d t o t he c orres p on din g
“Con ce pt IDs” of t he t ax on a n d tr ait dicti o nari es ( A ), a n d t h e pro por tio ns of t ax on a n d tr ait dicti o na ry terms t ha t w ere m atc he d
w i t h ann ot at ed en t i t ie s i n a ny a r t i c le ( B ). In (A) ‘m ’ repres e nts t h e to tal n um bers of t ax on an d trai t e ntiti es a n d ‘k’ indic at es h ow
man y of t h es e wer e ma pp e d t o Co nc ep t IDs in t h e dic tio nary termlist s (for tr ait e ntiti es divid e d int o f ee di ng ec olo gy, ha bit at , an d
morp h olo gy). In (B) ‘n’ re pr es en ts th e t ot al n umb ers of t ax a a n d trait s in t he dicti on ary t ermlists a nd ‘l’ in dica te s h ow ma ny o f
th es e w ere ma tch e d i n t h e articl es.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
16
Performance Comparisons of Natural Language Processing Models
Named Entity Reco gni tion Ba seline Perfor mance
An evaluation of a Named Entity Recognition (NER) baseline was conducted across various
configurations. Several general and domain-specific pre-trained language models were fine-
tuned on the TRAIN-GOLD dataset. To train the models, the dataset was converted to IOB2
format. Two evaluation methods were employed for the results presented in Figure 5: the
Conference on Natural Language Learning (CoNLL) evaluation and strict metrics. The reported
References
Abdelmageed N, Algergawy A, Samuel S, König-Ries B (2021) BiodivOnto: Towards a core
ontology for biodiversity. In: Verborgh R, Dimou A, Hogan A, d’Amato C, Tiddi I, Bröring
A, Mayer S, Ongenae F, Tommasini R, Alam M (Eds), The Semantic Web: ESWC 2021
Satellite Events. Lecture Notes in Computer Science. Springer International Publishing,
Cham, 3–8. https://doi.org/10.1007/978-3-030-80418-3_1
Abdelmageed N, Löffler F, Feddoul L, Algergawy A, Samuel S, Gaikwad J, Kazem A, König-
Ries B (2022) BiodivNERE: Gold standard corpora for named entity recognition and
relation extraction in the biodiversity domain. Biodiversity Data Journal 10: e89481.
https://doi.org/10.3897/BDJ.10.e89481
Agosti D, Bénichou L, Casino A, Nielsen L, Ruch P, Kishor P, Penev L, Mergen P, Arvanitidis C
(2024) Liberate the power of biodiversity literature as FAIR digital objects. Research
Ideas and Outcomes 10: e126586. https://doi.org/10.3897/rio.10.e126586
Agosti D, Benichou L, Addink W, Arvanitidis C, Catapano T, Cochrane G, Dillen M, Döring M,
Georgiev T, Gérard I, Groom Q, Kishor P, Kroh A, Kvač ek J, Mergen P, Mietchen D,
Pauperio J, Sautter G, Penev L (2022) Recommendations for use of annotations and
persistent identifiers in taxonomy and biodiversity publishing. Research Ideas and
Outcomes 8: e97374. https://doi.org/10.3897/rio.8.e97374
Allio R, Nabholz B, Wanke S, Chomicki G, Pérez-Escobar OA, Cotton AM, Clamens A-L,
Kergoat GJ, Sperling FAH, Condamine FL (2021) Genome-wide macroevolutionary
signatures of key innovations in butterflies colonizing new host plants. Nature
Communications 12: 354. https://doi.org/10.1038/s41467-020-20507-3
Bánki O, Roskov Y, Döring M, Ower G, Robles DRH, Corredor CAP, Jeppesen TS, Örn A, Pape
T, Hobern D, Garnett S, Little H, DeWalt RE, Ma K, Miller J, Orrell T, Aalbu R, Abbott J,
Aedo C, Aescht E, Alexander S, Alonso-Zarazaga MA, Alvarez B, Andrella GC,
Antonietto LS, Arango C, Artois T, Burgos MA, Atkinson S, Atwood JJ, Sartori ÂLB,
Bailly N, Baixeras J, Baker E, Balan A, Bamber R, Bandyopadhyay S, Barber-James H,
Pinto RB, Barrett R, Bartolozzi L, Bartsch I, Beccaloni G, Bellamy CL, Bellan-Santini D,
Bellinger PF, Ben-Dov Y, Blasco-Costa I, Boatwright JS, Bock P, Bolton B, Borges LM,
Bortoluzzi R, Bossard RL, Bota-Sierra C, Bouchard P, Bourgoin T, Boury-Esnault N,
Boxshall G, Boyko C, Brandão S, Braun H, Bray R, Brehm G, Brinda JC, Brock PD,
Broich SL, Brown J, Brown S, Bruce N, Brullo S, Bruneau A, Bush L, Büscher T,
Bła
ż ewicz-Paszkowycz M, Cabras A, Cairns S, Calonje M, Cardinal-McTeague W,
Cardoso D, Cardoso L, Castilho RC, Silva ICC, Cervantes A, Chernyshev A, Chevillotte
H, Choo LM, Christiansen KA, Cianferoni F, Cigliano MM, Clarke R, Monteiro TC e,
Collins A, Compton J, Copila
/i3 -Ciocianu D, Corbari L, Cordeiro R, Cortés-Hernández K,
Costello M, Crameri S, Cruz-López JA, Cárdenas P, Daly M, Daneliya M, Dauvin J-C,
Davie P, Broyer CD, Grave SD, Lima HCD, Prins JD, Prins WD, Sousa FD, Estrella MD
la, DeSalle R, Decker P, Decock W, Delgado-Salinas A, Deliry C, Dellapé PM, Heyer
JD, Dijkstra K-D, Dmitriev DA, Dohrmann M, Dorado Ó, Dorkeld F, Downey R, Duan L,
Díaz M-C, Eades DC, Egan AN, Eitel M, Nagar AE, Emig CC, Engel MS, Garrote PE,
Evans GA, Evenhuis NL, Falcão M, Farruggia F, Fauchald K, Fautin D, Favret C, Fisher
B, Fišer C, Forró L, Fortuna-Perez AP, Fortune-Hopkins H, Fritsch P, Froese R, Fuchs
A, Fujimoto S, Furuya H, Gagnon E, Garic R, Gasca R, Gattolliat J-L, Gerken S, Lima
AG de, Gibson D, Gielis C, Gilligan T, Giribet G, Duque JCG, Gittenberger A, Galdo GG
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
30
del, Gofas S, Goncharov M, Gondim AI, Goodwin C, Govaerts R, Grabowski M,
Granado A de A, Gregório B de S, Grehan JR, Grether R, Grimaldi DA, Gross O,
Guerra-García JM, Guglielmone A, Guilbert E, Frøslev TG, Gusenleitner J, Haas F,
Hadfield KA, Hajdu E, Hassler M, Hastriter MW, Hauser C, Hausmann A, Hayward BW,
Hendrycks E, Henry TJ, Hernandes FA, Hernández-Crespo JC, Hine A, Ho B-C, Hodson
A, Hoeksema B, Hoenemann M, Holstein J, Hooge M, Hooper J, Hopkins H, Horak I,
Horton T, Hošek J, Hughes C, Hughes L, Huys R, Häuser C, Janssens F, Jaume D,
Javadi F, Jazdzewski K, Jersabek CD, Johnson KP, Jordão L, Jó
ź wiak P, Kajihara H,
Kakui K, Kallies A, Kamiń ski MJ, Kanda K, Karanovic I, Kathirithamby J, Kelly M, Kim Y-
H, King R, Kirk P, Kitching I, Klautau M, Klitgaard BB, Koenemann S, Korovchinsky NM,
Kotov A, Kramina T, Krapp-Schickel T, Kremenetskaia A, Krishna K, Krishna V, Kroh A,
Kroupa AS, Kury AB, Kury MS, Kvač ek J, Lachenaud O, Lado C, Lambert G, Atunes
LLC, Lavin M, Lazarus D, Coze FL, Roux ML, LeCroy S, Linares JL, Lee S, Leitner MF,
Lewis GP, Li S-J, Li-Qiang J, Lichtwardt R(†), Lim S-C, Littlewood T, Lohrmann V,
Longhorn SJ, Lorenz W, Lowry J, Lozano F, Lumen R, Lyal CH, Lörz A-N, Madin L,
Magnien P, Mah C, Mal N, Mamos T, Manconi R, Mansano V, Markello K, Martens K,
Martin JH, Martin P, Mashego KS, Maslakova S, Maslin B, Mattapha S, McFadden C,
McKamey S, McMurtry JA, Medrano MA, Mees J, Mendes AC, Merrin K, Mesa NC,
Messing C, Mielke CGC, Migeon A, Miller DR, Mills C, Minelli A, Mitchell D, Molodtsova
T, Valls JFM, Mooi R, Morandini A, Rocha RM da, Morrow C, Moteetee A, Murillo-
Ramos L, Murphy B, Narita JPZ, Nery DG, Neu-Becker U, Neuhaus B, Newton A, Lin
PNK, Nicolson D, Nielsen JE, Nijhof A, Nishikawa T, Norenburg J, O’Hara T, Ochoa R,
Ohashi H, Ohashi K, Ollerenshaw J, Oosterbroek P, Opresko D, Osborne R, Osigus H-J,
Oswald JD, Ota Y, Otte D, Ouvrard D, Queiroz LP de, Pandey A, Paulay G, Paulson D,
Pauly D, Pennington RT, Pereira J da S, Perez-Gelabert D, Petrusek A, Phillipson P,
Pinheiro U, Morim MP, Pisera A, Pitkin B, Plotkin D, Pierezan BP, Poore G, Povydysh
M, Praxedes RA, Pulawski WJ, Pyle R, Pühringer F, Rajaei H, Rakotonirina N, Ramos
G, Rando J, Filardi FR, Raz L, Read G, Rees T, Reich M, Reimer JD, Rein JO,
Reynolds J, Rincón J, Rius M, Robertson T, Robinson G, Robinson GS(†), Rodríguez E,
Ruggiero M, Ríos P, Rützler K, Sanborn A, Sanjappa M, Santos SG, Santos-Guerra A,
Sartori M, Sattler K, Schierwater B, Schilling S, Schley R, Schmid-Egger C, Schmidt-
Rhaesa A, Schoolmeesters P, Schorr M, Schrire B, Schuchert P, Schuh RT, Schönberg
C, Rodrigues RS, Scoble M, Seijo G, Seleme EP, Senna A, Serejo C, Sforzi A, Shenkar
N, Shimizu G, Siegel V, Sierwald P, Sihvonen P, Flores AS, Carvalho CS de, Simon MF,
Simonsen T, Simpson CE, Sinniger F, Sirichamorn Y, Skvarla M, Smith AD, Smith VS,
Gissi DS, Sokoloff D, Sotuyo S, Soulier-Perkins A, South EJ, Souza-Filho JF, Spearman
L, Spelda J, Steiner A, Stemme T, Sterrer W, Stevenson D, Stiewe MBD, Stirton CH,
Straub S, Stueber G, Stöhr S, Subramaniam S, Swalla B, Swedo J, Sánchez-Ruiz M,
Sørensen MV, Taiti S, Takiya DM, Tandberg AH, Tavakilian G, Taylor K, Thessen A,
Thomas JD, Thomas P, Thomson S, Thuesen E, Thulin M, Thurston M, Thuy B, Todaro
A, Torke BM, Tsai S-Y, Turiault M, Turner JRG, Turner T, Turon X, Tyler S, Uetz P,
Ulmer JM, Vacelet J, Vachard D, Vader W, Domedel GV, Burgt XV der, Vandepitte L,
Vanhoorne B, Vatanparast M, Verhoeff T, Vonk R, Väinölä R, Walker-Smith G, Walter
TC, Wambiji N, Wanke D, Watling L, Weaver H, Webb J, Welbourn WC, Whipps C,
White K, Wilding N, Williams G, Wilson AJG, Wing P, Winitsky S, Wirth CC,
Wojciechowski M, Woodman S, Xavier J, Yi T, Yoder M, Yu DSK, Yunakov N, Zahniser
J, Zeidler W, Zhang R, Zhang ZQ, Zinetti F, d’Hondt J-L, Moraes GJ de, Oliveira ABR
de, Voogd N de, Río MG del, Haaren T van, Nieukerken EJ van, Ofwegen L van, Soest
R van,
Ş entürk O (2024) Catalogue of Life. Version 2024-12-19.
https://doi.org/10.48580/dglq4
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
31
Basaldella M, Furrer L, Tasso C, Rinaldi F (2017) Entity recognition in the biomedical domain
using a hybrid approach. Journal of Biomedical Semantics 8: 51.
https://doi.org/10.1186/s13326-017-0157-6
Buttigieg PL, Pafilis E, Lewis SE, Schildhauer MP, Walls RL, Mungall CJ (2016) The
environment ontology in 2016: bridging domains with increased scope, semantic density,
and interoperation. Journal of Biomedical Semantics 7: 57.
https://doi.org/10.1186/s13326-016-0097-6
Cejuela JM, McQuilton P, Ponting L, Marygold SJ, Stefancsik R, Millburn GH, Rost B, the
FlyBase Consortium (2014) Tagtog: Interactive and text-mining-assisted annotation of
gene mentions in PLOS full-text articles. Database 2014: bau033–bau033.
https://doi.org/10.1093/database/bau033
CERN, OpenAIRE (2024) Zenodo. https://doi.org/10.25495/7GXK-RD71
Chang A, Jeske L, Ulbrich S, Hofmann J, Koblitz J, Schomburg I, Neumann-Schaal M, Jahn D,
Schomburg D (2021) BRENDA, the ELIXIR core data resource in 2021: new
developments and updates. Nucleic Acids Research 49: D498–D508.
https://doi.org/10.1093/nar/gkaa1025
Church SH, Donoughe S, De Medeiros BAS, Extavour CG (2019) A dataset of egg size and
shape from more than 6,700 insect species. Scientific Data 6: 104.
https://doi.org/10.1038/s41597-019-0049-y
Cohen J (1960) A coefficient of agreement for nominal scales. Educational and Psychological
Measurement 20: 37–46. https://doi.org/10.1177/001316446002000104
Comeau DC, Islamaj Dogan R, Ciccarese P, Cohen KB, Krallinger M, Leitner F, Lu Z, Peng Y,
Rinaldi F, Torii M, Valencia A, Verspoor K, Wiegers TC, Wu CH, Wilbur WJ (2013) BioC:
a minimalist approach to interoperability for biomedical text processing. Database 2013:
bat064–bat064. https://doi.org/10.1093/database/bat064
Cornwallis CK, Griffin AS (2024) A guided tour of phylogenetic comparative methods for
studying trait evolution. Annual Review of Ecology, Evolution, and Systematics 55: 181–
204. https://doi.org/10.1146/annurev-ecolsys-102221-050754
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional
transformers for language understanding. Proceedings of the 2019 Conference of the
North: 4171–4186. https://doi.org/10.18653/v1/N19-1423
Farrell MJ, Brierley L, Willoughby A, Yates A, Mideo N (2022) Past and future uses of text
mining in ecology and evolution. Proceedings of the Royal Society B: Biological
Sciences 289: 20212721. https://doi.org/10.1098/rspb.2021.2721
Farrell MJ, Le Guillarme N, Brierley L, Hunter B, Scheepens D, Willoughby A, Yates A, Mideo N
(2024) The changing landscape of text mining: a review of approaches for ecology and
evolution. Proceedings of the Royal Society B: Biological Sciences 291: 20240423.
https://doi.org/10.1098/rspb.2024.0423
Feron R, Waterhouse RM (2022a) Assessing species coverage and assembly quality of rapidly
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
32
accumulating sequenced genomes. GigaScience 11: giac006.
https://doi.org/10.1093/gigascience/giac006
Feron R, Waterhouse RM (2022b) Exploring new genomic territories with emerging model
insects. Current Opinion in Insect Science 51: 100902.
https://doi.org/10.1016/j.cois.2022.100902
Furrer L, Cornelius J, Rinaldi F (2022) Parallel sequence tagging for concept recognition. BMC
Bioinformatics 22: 623. https://doi.org/10.1186/s12859-021-04511-y
Gaimari SD (2017) The dipteran family Celyphidae in the New World, with discussion of and key
to world genera (Insecta, Diptera). ZooKeys 711: 113–130.
https://doi.org/10.3897/zookeys.711.20840
Grimaldi DA, Engel MS (2005) Evolution of the insects. Cambridge university press, Cambridge.
Guidoti M, Sokolowicz C, Simoes F, Gonçalves V, Ruschel T, Alvares D, Agosti D (2021)
TreatmentBank: Plazi’s strategies and its implementation to most efficiently liberate data
from scholarly publications. Biodiversity Information Science and Standards 5: e75690.
https://doi.org/10.3897/biss.5.75690
Hedrick BP, Heberling JM, Meineke EK, Turner KG, Grassa CJ, Park DS, Kennedy J, Clarke
JA, Cook JA, Blackburn DC, Edwards SV, Davis CC (2020) Digitization and the future of
natural history collections. BioScience 70: 243–251. https://doi.org/10.1093/biosci/biz163
Jaron KS, Parker DJ, Anselmetti Y, Tran Van P, Bast J, Dumas Z, Figuet E, François CM,
Hayward K, Rossier V, Simion P, Robinson-Rechavi M, Galtier N, Schwander T (2022)
Convergent consequences of parthenogenesis on stick insect genomes. Science
Advances 8: eabg3842. https://doi.org/10.1126/sciadv.abg3842
Keck F, Broadbent H, Altermatt F (2025) Extracting massive ecological data on state and
interactions of species using large language models.
https://doi.org/10.1101/2025.01.24.634685
Kim Sang EFT, De Meulder F (2003) Introduction to the CoNLL-2003 shared task: Language-
independent named entity recognition. In: Proceedings of the Seventh Conference on
Natural Language Learning at HLT-NAACL 2003. , 142–147. Available from:
https://aclanthology.org/W03-0419/.
Kissling WD, Walls R, Bowser A, Jones MO, Kattge J, Agosti D, Amengual J, Basset A, Van
Bodegom PM, Cornelissen JHC, Denny EG, Deudero S, Egloff W, Elmendorf SC,
Alonso García E, Jones KD, Jones OR, Lavorel S, Lear D, Navarro LM, Pawar S, Pirzl
R, Rüger N, Sal S, Salguero-Gómez R, Schigel D, Schulz K-S, Skidmore A, Guralnick
RP (2018) Towards global data products of Essential Biodiversity Variables on species
traits. Nature Ecology & Evolution 2: 1531–1540. https://doi.org/10.1038/s41559-018-
0667-3
Le Guillarme N, Thuiller W (2022) TaxoNERD: Deep neural models for the recognition of
taxonomic entities in the ecological and evolutionary literature. Methods in Ecology and
Evolution 13: 625–641. https://doi.org/10.1111/2041-210X.13778
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
33
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2020) BioBERT: a pre-trained biomedical
language representation model for biomedical text mining. Wren J (Ed.). Bioinformatics
36: 1234–1240. https://doi.org/10.1093/bioinformatics/btz682
Lever J, Altman R, Kim J-D (2020) Extending TextAE for annotation of non-contiguous entities.
Genomics & Informatics 18: e15. https://doi.org/10.5808/GI.2020.18.2.e15
Li Y, Ramprasad R, Zhang C (2024) A simple but effective approach to improve structured
language model output for information extraction.
https://doi.org/10.48550/ARXIV.2402.13364
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V
(2019) RoBERTa: A robustly optimized BERT pretraining approach. Available from:
http://arxiv.org/abs/1907.11692 (December 16, 2024).
Mammola S, Pavlek M, Huber BA, Isaia M, Ballarin F, Tolve M, Č upić I, Hesselberg T, Lunghi E,
Mouron S, Graco-Roza C, Cardoso P (2022) A trait database and updated checklist for
European subterranean spiders. Scientific Data 9: 236. https://doi.org/10.1038/s41597-
022-01316-3
Marcos D, van de Vlasakker R, Athanasiadis IN, Bonnet P, Goeau H, Joly A, Kissling WD,
Leblanc C, van Proosdij ASJ, Panousis KP (2024) Fully automatic extraction of
morphological traits from the Web: utopia or reality?
https://doi.org/10.48550/ARXIV.2409.17179
McCallen E, Knott J, Nunez/i3 Mir G, Taylor B, Jo I, Fei S (2019) Trends in ecology: shifts in
ecological research themes over the past four decades. Frontiers in Ecology and the
Environment 17: 109–116. https://doi.org/10.1002/fee.1993
McHugh ML (2012) Interrater reliability: the kappa statistic. Biochemia Medica 22: 276–282.
Montagna M (2011) Pachybrachis sassii, a new species from the Mediterranean Giglio Island
(Italy) (Coleoptera, Chrysomelidae, Cryptocephalinae). ZooKeys 155: 51–60.
https://doi.org/10.3897/zookeys.155.1951
Montani I, Honnibal M, Boyd A, Van Landeghem S, Peters H (2023) explosion/spaCy: v3.7.2:
Fixes for APIs and requirements. https://doi.org/10.5281/ZENODO.1212303
Mündler N (2024) nielstron/quantulum3. Available from: https://github.com/nielstron/quantulum3
(November 25, 2024).
Mungall C, Matentzoglu N, Balhoff J, Osumi-Sutherland D, Duncan B, Pgaudet, Tan S, Hoyt CT,
Pilgrim C, Overton JA, Lauren, Caron A, Nomi Harris, Moxon S, Lschriml, Vasilevsky N,
Toro S, Goutte-Gattat D, Brush M, Vasundra Touré, Bretaudeau A, Cain S, Haendel M,
DiatomsRcool, Bide Zhang, Dowland C, Dooley D, Actions-User, Hammock J (2023)
The OBO relation ontology, http://purl.obolibrary.org/obo/ro.owl.
https://doi.org/10.5281/ZENODO.593101
Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA (2012) Uberon, an integrative multi-
species anatomy ontology. Genome Biology 13: R5. https://doi.org/10.1186/gb-2012-13-
1-r5
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
34
Parr CS, Wilson N, Leary P, Schulz K, Lans K, Walley L, Hammock J, Goddard A, Rice J,
Studer M, Holmes J, Corrigan, Jr. R (2014) The Encyclopedia of Life v2: Providing global
access to knowledge about life on Earth. Biodiversity Data Journal 2: e1079.
https://doi.org/10.3897/BDJ.2.e1079
Pasche E, Agosti D, Penev L, Groom Q, Flament A, Gobeill J, Ruch P (2023a) Towards
“Biodiversity PMC.” Biodiversity Information Science and Standards 7: e111647.
https://doi.org/10.3897/biss.7.111647
Pasche E, Gobeill J, Agosti D, Penev L, Groom Q, Georgiev T, Gaillac E, Flament A,
Caucheteur D, Michel P-A, Ruch P (2023b) From SIBiLS to Biodiversity PMC:
Foundations for the One Health Library. Biodiversity Information Science and Standards
7: e111660. https://doi.org/10.3897/biss.7.111660
Ramshaw LA, Marcus MP (1999) Text chunking using transformation-based learning. In:
Armstrong S, Church K, Isabelle P, Manzi S, Tzoukermann E, Yarowsky D (Eds),
Natural Language Processing Using Very Large Corpora. Text, Speech and Language
Technology. Springer Netherlands, Dordrecht, 157–176. https://doi.org/10.1007/978-94-
017-2390-9_10
Rosonovski S, Levchenko M, Ide/i3 Smith M, Faulk L, Harrison M, McEntyre J (2023) Searching
and evaluating publications and preprints using Europe PMC. Current Protocols 3: e694.
https://doi.org/10.1002/cpz1.694
Shirey V, Larsen E, Doherty A, Kim CA, Al-Sulaiman FT, Hinolan JD, Itliong MGA, Naive MAK,
Ku M, Belitz M, Jeschke G, Barve V, Lamas G, Kawahara AY, Guralnick R, Pierce NE,
Lohman DJ, Ries L (2022) LepTraits 1.0 A globally comprehensive dataset of butterfly
traits. Scientific Data 9: 382. https://doi.org/10.1038/s41597-022-01473-5
Stork NE (2018) How many species of insects and other terrestrial arthropods are there on
Earth? Annual Review of Entomology 63: 31–45. https://doi.org/10.1146/annurev-ento-
020117-043348
Wong MKL, Guénard B, Lewis OT (2019) Trait/i3 based ecology of terrestrial arthropods.
Biological Reviews 94: 999–1022. https://doi.org/10.1111/brv.12488
Yamada I, Asai A, Shindo H, Takeda H, Matsumoto Y (2020) LUKE: Deep contextualized entity
representations with entity-aware self-attention. In: Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing (EMNLP). Association for
Computational Linguistics, Online, 6442–6454. https://doi.org/10.18653/v1/2020.emnlp-
main.523
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint
35
Supplementary materials
Supplementary File S1: Curated trait dictionaries
An MS Excel spreadsheet presenting the lists of trait dictionaries for feeding ecology, habitat,
and morphology, with links to the source resources, synonyms, and definitions.
Supplementary File S2: Annotator guidelines
A PDF file of the notes and guidelines developed by the annotators during the curation of the
gold-standard annotation data.
Supplementary File S3: Gold-standard annotated documents
An MS Excel spreadsheet listing the annotated files, the number of annotations for each type,
and the corresponding annotators.
Supplementary File S4: NER and RE baseline results
An MS Excel spreadsheet containing five tables with exact scores for all configurations, in terms
of recall, precision, and F-score values, along with the corresponding support for each class and
the macro and weighted averages.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 23, 2025. ; https://doi.org/10.1101/2025.02.18.638830doi: bioRxiv preprint