Extracting specimen label data rapidly with a smartphone – a great help for simple digitization in taxonomy and collection management

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 40,040 characters · extracted from oa-pdf · 9 sections · click to expand

Abstract

Here we provide short tutorials to read out specimen label data from type- as well as handwritten labels in a rapid and easy way with a mobile phone. We apply this in general, but test this in particular for insect specimen labels, which are generally quite small. We provide alterative procedure instructions for Android and Apple based environments, as well as protocols for single and bulk scans. We expect that this way of data capture will be of great help for a simple digitization in taxonomy and collection management, off the large industrial digitization pipelines. With omitting the step of taking/maintaining images of the labels, this approach is more rapid, cheaper, and environmentally more sustainable because no storage with carbon footprint is required for label images. The biggest advantage of this protocol is the use of readily available commercial devices, which are easy to handle as they are used on a daily basis and Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 2 can be replaced at relatively low cost when they come into (informatic) age, which is also a matter of cyber security.

Keywords

Collection digitization, labels, label transcription, taxonomic revisions, artificial intelligence, citizen science, taxonomic impediment, data science

Introduction

Currently, there are immense efforts on the way to digitize natural history collections on a large scale, including the associated information and metadata (e.g., Smith & Blagoderov 2012; Hardisty et al. 2020; Belot et al. 2023; Groom et al. 2023; Ong et al. 2023). In these endeavors, among other things the automatic capture of label data plays a central role (e.g., Beaman et al. 2006; Heidorn & Wei 2008; Lafferty and Landrum 2009; Granzow-de la Cerda and Beach 2010; Haston et al. 2012; Agarwal et al. 2018; Alzuru et al. 2019, 2020; Alzuru 2020; Owen et al. 2020; Belot et al. 2023; Takano et al. 2024; Zhang 2023). However, many of these very promising activities have been for long exclusive to large companies, museums or institutions with specialized technical infrastructure and special trained staff (e.g., Blagoderov et al. 2012) for the highly customized implementations used (e.g., https://picturae.com/). Most of the current digitization initiatives aim at a one-go retro-digitization of large collections (Engledow et al. 2018; Hardisty et al. 2020; Helminger et al. 2020; De Smedt et al. 2024). However, this approach comes with limitations: 1) Collections are continuously growing and developing (see also Balke et al. 2013); 2) the scientific community produces a large amount of high-quality biodiversity data independently of the collection institutions with their ongoing research on the specimens, in which amateur scientists are also largely involved (Löbl et al. 2023). The latter is connected with the often-remote study of the collection material, off the collections and large digitization pipelines. Especially in insects, taxonomic specialists are rare, and Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 3 specimens are often loaned by shipment overseas to obtain best IDs from world leading specialists. In this, working processes are quite different from those of vertebrates or plants having often the lead in new methodologies, such as large-scale digitization. However, these data often do not yet end up in big data repositories, also due to the lack of time and stimulus as well as the work-overload of the taxonomists. Therefore, more flexible solutions are needed which allow a more efficient data processing and that allow to speed up biodiversity/ species discovery and help to overcome taxonomic impediment. This would be perfectly in line with the idea of integrating specimen databases and revisionary systematics (Schuh 2012). Advantages of a revision-based digitization (see also Meier and Dikow 2004) in contrast to a retro- digitization, i.e. that biodiversity data come from taxonomic revisionary studies–rather than from uncritical digitizing of museum specimen data, are the following (extended, based on Meier and Dikow (2004) as well as Schuh (2012)): 1) the data are provided in association with the most accurate identifications, 2) the data have the most complete taxonomic and geographic coverage, 3) and the data satisfy these points in a cost- effective way, 4) data for occurrences and images are citable and acknowledgeable (therefore, errors can be retracted and be corrected). Recently we came across, that mobile devices used to be in the hand of almost every person may assist in this aim to speed up data collection and digitization including biodiversity discovery. By simple playful experimenting, we discovered, how useful mobile phones can be in association with cloud-like environments (such as Google or Apple). Since we think that these “workflows” can be really useful for a large audience, we prepared this short paper to disseminate the(se) simple tutorial(s) for how to read out specimen label data in a rapid and easy way with a smartphone. Most digitization approaches envision the capture of digital metadata (e.g., labels) with the intermediate step of digital images (Nelson et al. 2012). This comes with other difficulties and quite considerable costs for image processing and storage (Tann and Flemons 2008; Hardisty et al. 2020b). In an optimized balance of a cost-benefit ratio, it would be therefore more sustainable to skip this step if data can be read out and being spell-check in the same moment without the burden of images. The latter are Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 4 scientifically and practically quite unnecessary (in terms of cost-benefit balance) for non- type specimens.

Material and methods

Resources needed 1) A mobile phone (i.e. smartphone, not too old model with macro photography options). 2) A stable Internet connection of the phone via WLAN or mobile telephone signal. 3) A computer connected with internet and logged into a Google account (via Google Chrome Browser) or AppleID account 4) A data base/ text file to insert the specimen’s data. 5) Google Lens or Google translator installed on the mobile phone. For our testing here, we used a “Motorola G5g plus (system: Android 10)” and “iPhone 15 Pro Max” (system: iOS 17.7). We explored the data extraction form the labels with different approaches and alternative label conditions (Figure 1). Each of the different tutorials can be proven to be more suitable for different technical situations of the user. We describe subsequently all of these in simplified step to step tutorials. Tutorials are accompanied by screen shots and examples of resulting data sheets. 1. Operational system open: Variant 1 1) Open the Apps in your mobile phone: Google translator (/ Google Lens) 2) Focus on the label to be scanned, eventually virtually zoom in via the touch screen of your mobile phone, that the label(s) to scan are filling the screen as Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 5 best as possible (no need to be super focused, important letters are to be recognizable) 3) Scan (snapshot with bottom ‘photo’) 4) Mark the label text (Figure 1) via cursor selection by the touch screen of the mobile phone 5) Select “Copy to Computer” 6) Confirm the selected device (Computer with which you are logged into your google account): by choosing “select” 7) On your computer: simply paste from clipboard into your target document (verbatim label citation) 8) Finally, you may proofread the scan (while having still your specimen in front of you) and manually correct misspellings/ readings 9) Finished. *Alternatively, in step 5 can be also chosen “copy” and the copied content can be pasted into an open google word document on the mobile device. The latter could be directly accessed on the (via google account) synchronized computer. This step rarely is necessary if the internet is overloaded, or the internet connection is too slow (see

Results

below). This also works outside of the Google Cloud environment but is a little more complex: Files can be shared between Android, Windows or Mac devices using the KDE Connect app (https://kdeconnect.kde.org). All devices must be in the same WIFI network. After installing the KDE Connect app, the text can be transferred to the computer. Variant 2 (bulk scans) 1) Open the Google Keep – Notes and Lists app on your mobile phone. 2) Clique on a picture icon. Focus on the label you want to scan. 3) Click on “take photo” to capture the image and then on the checkmark icon to save it. Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 6 4) Click on the image, then on the three dots in the upper right corner, and then on “grab image text”. The text will appear as a note and can be manually corrected for spellings or readings errors. A title for the note can also be added. 5) Repeat steps 2-4 for each different label you want to scan. They will be saved as different notes. 6) Select all notes, click on the three dots in the upper right corner, and then on “copy to Google Docs” (This step can be alternatively done already on the computer via the respective google account; see Figure 5). A single Word document containing all images and texts will be generated. (This step can be done on your mobile or on a computer logged into your Google account) 7) On your computer: open your Google Docs file, and the final corrections can be made. 2. With an “Apple-only” environment Requirements: Make sure you have a recent iPhone or iPad model with macro photography capabilities and the most recent operating system (preferably iOS 15 and later). You will also need a Mac computer and an Apple iCloud Account (at least the free version). An internet connection of the phone (e.g., via WLAN) is not necessary for data collection, if you collect your data from the specimen labels first on your phone (bulk scans) and go back to your Mac computer later. a) Using Notes app: 1) Open the Notes app on your iPhone and set up a new note for your current project. 2) In your note, tap the camera symbol at the bottom and choose “scan text” from the pop-up menu. A camera window opens in the bottom part of your note. 3) Aim your camera at the text block you want to scan. Yellow brackets will show you which text block the software sees as target. Once the desired target text is Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 7 within the brackets press the insert button at the bottom of the camera window. The targeted text will be read and automatically transferred to your note. 4) Briefly check the result in your note. 5) Go to the next line in your note and scan the next target text in the same way, thus accumulating information from multiple specimen labels or multiple specimens as you like. 6) Once happy with the collected data, return to your desktop Mac computer. If the phone had telephone connection with your provider while you took the scans or on your way back to your desktop computer, the Notes app should automatically synchronize with your Apple Account in the background so that when you open the Notes app on your desktop computer, you should find all the scanned data there. 7) Continue to copy and paste the information accumulated in your Notes app to the document or database of your choice. b) Using the Shortcuts app: The Shortcuts app of iOS can be used to program an automated process from taking the photo, extracting the text and filling a table in Apple’s spreadsheet app Numbers. Make sure that your Shortcuts and Numbers apps are synchronized for all of your devices via your iCloud drive. We assembled a Shortcuts algorithm as a proof of concept. Fig. 4 shows the algorithm. 3. Without internet connection using a Bluetooth-approach (using a Windows PC and a mobile phone with Android system) 1) Download the app (Google keeps – Notes and Lists) on the mobile phone 2) Open “Bluetooth” options in the computer 3) Pair the devices (computer and mobile phone) 4) Click on receive files via Bluetooth 5) Open the app and click on the picture icon Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 8 6) Click on “take photo” and take the photo 7) Click on the captured picture 8) Click on the three dots in the upper right corner 9) Click on “grab image text and select the extracted text 10) Click on the three dots in the lower right corner and click on “send” 11) Click on “send via other apps” and choose the Bluetooth symbol 12) Choose a folder to save the html file in the computer 13) Copy the text from the html-file into a text editor for final spelling corrections We expect this approach to work in a similar way also in the Apple environment.

Results

In Table 1 we summarized some major characteristics of the data capture with these methods, showing directly pasted content and the necessary amount of real-time spell corrections for the data. While for printed labels the need for subsequent spelling corrections was minimal, handwritten labels needed often more corrections, depending on the size and style of handwriting. In these cases, scanning the labels separately from the pin without distortion helped quite much (Fig. 1A, B). In printed labels, direction (Fig 2C) and distortion of labels did not matter much (Fig 2D). We were able to scan up to three labels (from the distorted side view) still mounted at a pin and without flipping out the labels or even to remove them (Fig 2D). Since low image resolution was not a problem, we could zoom-in digitally with the mobile phone into the labels until these were almost format filling. However, the initial testing was successfully done also with much smaller images (Fig. 1A-D). The average processing time per specimen was very fast, the estimated time for full data capture (including spell correction) was 3-10 seconds per specimen. Processing time was often a little longer for badly handwritten labels, or when an insect pin or other labels covered parts of the label text, or when the overall internet connection was slow. The total time gain per label was larger with labels containing much information or with Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 9 multiple labels. For example, in the ones shown in Figure 2B, typing the data by hand into the computer requested 60 seconds (including spell check), the label scan with the approach 1 took 8 seconds (including manual spell check). For the data of Figure 2C, manual typing and spell-check required 121 seconds, while the label scan with the approach 1 took 10 seconds (including manual spell check). We did refrain from larger experiments on measuring comparatively the time, since duration of typing data depends much on the typing skills of the person. The comparative numbers given here, refer to a typing-untrained scientist (performed by D.A.). In some instances, in approach 1, we had to use the deviation via a Google document* due to bad internet connection, when the copy process failed due to slow data transfer. This was then usually two “clicks” (or seconds) slower, but not really a big delay compared to the amount of time required for manual typing. The iPhone workflow test with was done with a larger label (Fig. 3B). In the workflow 2a) above, using the Notes app on the Apple iPhone, the image recognition tried to identify and focus blocks of text within the label, but did not to capture the label as a whole. To capture multiple bits of information the process had to be reiterated accordingly. Once the data has been collected in Notes, further copy-paste editing is necessary to transfer the data to a database. Workflow 2b), using Shortcuts app automation (Fig. 4) scans the whole label and also stores the data with a timestamp directly into a spreadsheet app. Furthermore, the photos are stored in the user's Apple iCloud account (as backups for potential later reference), but this step is optional in the algorithm. The result of the scanning is shown in Fig. 3B, C. Note that incomplete text in the original caused interpretation problems (truncated third line and partially hidden bottom part of "image 0355"). In addition, the algorithm wants to place each recognized line of text in a separate cell. If several lines belong to the same block of information, editing of the cells was necessary. The scan of the label and filling of the cells in the spreadsheet took less than 10 s. The algorithm analyzed the Label as lines of text and allocated one cell per line in the spreadsheet. This means that the locality information in our example was split up into two cells in our test. Depending on which further tasks the user wants to accomplish copy-paste processing of such splits will be necessary. Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 10 The approach using a Bluetooth connection between the mobile phone and the computer appeared to be slightly longer (by the amount of “device clicks”), however, yet saved incredible amount of time for scanning the label data. Given the widely experienced situations than many collection magazines are partly or entirely offline, or remotely working taxonomists might have difficulties to have a good internet connection. Bulk approaches are available under the Google and Apple environment (see Fig. 4) with the Google Keep and Notes applications, respectively. In both, images are temporarily stored in the mobile devices, which can be subsequently either being saved of discarded. While they safe time with the data transfer, they have the disadvantage that potentially incomplete scans are only discovered when the specimens are already out of hand.

Discussion

While new technology including artificial intelligence is entering in our daily life, their use and application in biodiversity research is yet rather limited, although there have been developed approaches to using AI-powered label recognition (Johaadien 2023, Takano et al. 2024, Waever et al. 2023). Similar smartphone tutorials have been already provided for specimen photography (Riyaz & Ignacimuthu 2023), although maybe already being widely in use without being formally addressed in the scientific literature. Here we addressed the scanning of label information using a smartphone under different operating systems. According to our knowledge, this has not been so far explored and applied particularly with insect collection specimens. There are solutions for large-scale mass digitization of collections (Belot et al. 2023; Blagoderov 2012; Engledow et al. 2018; Tegelberg et al. 2014). All of these solutions require manual separation of specimens and labels in order to photograph them separately. Initial trials with robotic technology (e.g. Dupont & Price 2019) are promising but can only be used by larger institutions with the appropriate budget. Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 11 With partly omitting the so far obligatory step of taking and permanently storing images of the labels, this direct approach of data capture is more rapid and environmentally more sustainable. In a part of our procedures, this happens nevertheless without delay in the background and there is the option to retain the images or to discard them. Especially, for a simple distribution data extraction in the framework of taxonomic revisions or faunistic studies, there is scientifically no necessity to hold images of the metadata labels of every specimen long term. Moreover, the spell-checking of the scanned and extracted data can be done yet with the specimen at hand, with the data finalized once and for all after the first processing. However, depending on the individual needs and working conditions, the user has the choice on the individual workflow. It is possible to scan 50 labels in a row (i.e., bulk workflow) before transferring the data to the computer. Then in some critical cases, having a backup photo is good for quality assessment and spell-check. One other great advantage is, that these protocols use commercial devices which are simple to handle, and which are for little costs to replace when they come into (informatic) age which is also a matter of cybersecurity. Unfortunately, in biosystematics we have been make often the experience that customized devices are overpriced, often behind the technological advances (e.g. computer operational systems) requiring often expensive updates and service. Since biodiversity research is also done by a great portion of amateur scientists (and even professionals always lack funding for their “descriptive research”), these people do not have access to large or continuous funding. Other consequences: The high reliability of text recognition and the rapid data transfer make the use of (only-)machine readable barcode labels and QR codes in collection management superfluous since connected data can be easily inferred from numerical voucher numbers on labels. Our solutions and tutorial proposed here are very well suited for the fast and secure recording of small quantities of collection objects, e.g. when visiting a collection or when selecting individual objects. We are aware that habits, skills and specific workflows influence the way we integrate such devices and text recognition capabilities. We are convinced that they will make a significant contribution and help to alleviate the Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 12 taxonomic impediment (e.g., de Carvalho et al. 2005, 2007; Engel et al. 2021), as the workload for taxonomist recording the material they study in databases will be reduced by at least tenfold. Finally, it should be said, that there might be even more options and possibilities to scan labels with mobile devices. These options might evolve as quickly as mobile phones and artificial intelligence technology, in general. Nevertheless, we expect the potential user to take this paper as an inspiration to continue exploring options on how to apply this technology successfully in their established workflows.

Acknowledgements

We are thanking the numbers colleagues who helped with discussing the argument and whose outcome encouraged use to pursue with this article.

References

Agarwal N, Ferrier N, Hereld M (2018) Towards automated transcription of label text from pinned insect collections. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, pp. 189-198, doi: 10.1109/WACV.2018.00027. Alzuru I (2020) Human-machine extraction of Information from biological collections. PhD thesis, University Florida, 160pp. Alzuru I, Malladi A, Matsunaga A, Tsugawa M, José FAB. (2019) Human-Machine Information Extraction Simulator for Biological Collections. 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, pp. 4565-4572, doi: 10.1109/BigData47090.2019.9005601. Alzuru I, Matsunaga A, Tsugawa M, Fortes JAB (2020) General Self-aware Information Extraction from Labels of Biological Collections. 2020 IEEE International Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 13 Conference on Big Data (Big Data), Atlanta, GA, USA, pp. 3035-3044, doi: 10.1109/BigData50022.2020.9377737. Balke M, Schmidt S, Hausmann A et al. (2013) Biodiversity into your hands - A call for a virtual global natural history ‘metacollection’. Frontiers in Zoology 10: 55 https://doi.org/10.1186/1742-9994-10-55 Beaman RS, Cellinese N. Heidorn PB, Guo Y, Green AM, Thiers B (2006) HERBIS: Integrating digital imaging and label data capture for herbaria [Abstract]. Botany 2006, California State University – Chico. 28 July–2 August 2006. http://www.2006.botanyconference.org/engine/search/index.php?func=detail&aid =402. Belot M, Preuss L, Tuberosa J, Claessen M, Svezhentseva O, Schuster F, Bölling C, Léger T (2023) High Throughput Information Extraction of Printed Specimen Labels from Large-Scale Digitization of Entomological Collections using a Semi- Automated Pipeline. Biodiversity Information Science and Standards 7: e112466. https://doi.org/10.3897/biss.7.112466 Blagoderov V, Kitching I, Livermore L, Simonsen T, Smith VS (2012) No specimen left behind: industrial scale digitization of natural history collections. ZooKeys 209: 133-146. https://doi.org/10.3897/zookeys.209.3178 de Carvalho MR, Bockmann FA, Amorim DS, de Vivo M, de Toledo-Piza M, Menezes NA, de Figueiredo JL, McEachran JD (2005) Revisiting the taxonomic impediment. Science 307: 353-353. DOI:10.1126/science.307.5708.353b de Carvalho MR, Bockmann FA, Amorim DS et al. (2007) Taxonomic Impediment or Impediment to Taxonomy? A Commentary on systematics and the cybertaxonomic-automation paradigm. Evolutionary Biology 34: 140–143 https://doi.org/10.1007/s11692-007-9011-6 De Smedt S, Bogaerts A, De Meeter N, Dillen M, Engledow H, Van Wambeke P, Leliaert F, Groom Q (2024) Ten lessons learned from the mass digitisation of a herbarium collection. PhytoKeys 244: 23-37. https://doi.org/10.3897/phytokeys.244.120112 Dupont S, Price BW (2019) ALICE, MALICE and VILE: High throughput insect specimen digitisation using angled imaging techniques. Biodiversity Information Science and Standards 3: e37141. https://doi.org/10.3897/biss.3.37141 Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 14 Engel MS, Ceríaco LMP, Daniel GM, et al. (2021) The taxonomic impediment: a shortage of taxonomists, not the lack of technical approaches. Zoological Journal of the Linnean Society 193(2): 381–387. https://doi.org/10.1093/zoolinnean/zlab072 Engledow H, De Smedt S, Groom Q, Bogaerts A, Stoffelen P, Sosef M, Van Wambeke P (2018) Managing a mass digitization project at Meise Botanic Garden: From start to finish. Biodiversity Information Science and Standards 2: e25912. https://doi.org/10.3897/biss.2.25912 Granzow-de la Cerda Í, Beach JH (2010) Semi-automated workflows for acquiring specimen data from label images in herbarium collections. Taxon 59: 1830-1842. https://doi.org/10.1002/tax.596014 Groom Q, Dillen M, Addink W, Ariño AHH, Bölling C, Bonnet P, Cecchi L, Ellwood ER, Figueira R, Gagnier P-Y, Grace OM, Güntsch A, Hardy H, Huybrechts P, Hyam R, Joly AAJ, Kommineni VK, Larridon I, Livermore L, Lopes RJ, Meeus S, Miller JA, Milleville K, Panda R, Pignal M, Poelen J, Ristevski B, Robertson T, Rufino AC, Santos J, Schermer M, Scott B, Seltmann KC, Teixeira H, Trekels M, Gaikwad J (2023) Envisaging a global infrastructure to exploit the potential of digitised collections. Biodiversity Data Journal 11: e109439. https://doi.org/10.3897/BDJ.11.e109439 Haston E, Cubey RWN, Pullan M, Atkins H, Harris D (2012) Developing integrated workflows for the digitisation of herbarium specimens using a modular and scalable approach. ZooKeys 209: 93-102. https://doi.org/10.3897/zookeys.209.3121 Hardisty A, Saarenmaa H, Casino A, Dillen M, Gödderz K, Groom Q, Hardy H, Koureas D, Nieva de la Hidalga A, Paul DL, Runnel V, Vermeersch X, van Walsum M, Willemse L (2020a) Conceptual design blueprint for the DiSSCo digitization infrastructure - DELIVERABLE D8.1. Research Ideas and Outcomes 6: e54280. https://doi.org/10.3897/rio.6.e54280 Hardisty A, Livermore L, Walton S, Woodburn M, Hardy H (2020b) Costbook of the digitisation infrastructure of DiSSCo. Research Ideas and Outcomes 6: e58915. https://doi.org/10.3897/rio.6.e58915 Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 15 Heidorn PB, Wei Q (2008) Automatic metadata extraction from museum specimen labels. Pp. 57–68 in: Greenberg, J. & Klas, W. (eds.), Metadata for semantic and social applications: Proceedings of the International Conference on Dublin Core and Metadata Applications, Berlin, 22–26 September 2008, DC 2008: Berlin, Germany. Göttingen: Universitätsverlag Göttingen. Helminger T, Weber O, Braun P (2020) Digitisation of the LUX herbarium collection of the National Museum of Natural History Luxembourg. Bulletin de la Société des naturalists luxembourgeois 122: 147-152. Johaadien R, Torma M (2023) “Publish First”: A Rapid, GPT-4 based digitisation system for small institutes with minimal resources. Biodiversity Information Science and Standards 7: e112428. https://doi.org/10.3897/biss.7.112428 Lafferty D, Landrum LR (2009) SALIX, a semi-automatic label information extraction system using OCR [Abstract]. Botany & Mycology 2009, Snowbird, Utah, 25–29 July 2009. http://2009.botanyconference.org/engine/search/index.php?func=detail&aid=130 (accessed 21.X.2024). Löbl I, Klausnitzer B, Hartmann M (2022) Das stille Aussterben von Arten und Taxonomen – ein Appell an Wissenschaftspolitik und Legislative. Entomologische Nachrichten und Berichte 66(3): 217-226. Meier R & Dikow T (2004) Significance of specimen databases from taxonomic revisions for estimating and mapping the global species diversity of invertebrates and repatriating reliable specimen data. Conservation Biology 18: 478-488. https://doi.org/10.1111/j.1523-1739.2004.00233.x Nelson G, Paul D, Riccardi G, Mast A (2012) Five task clusters that enable efficient and effective digitization of biological collections. ZooKeys 209: 19-45. https://doi.org/10.3897/zookeys.209.3135 Ong S-Q, Mat Jalaluddin, NS, Yong KT, Ong SP, Lim KF, Azhar S (2023) Digitization of natural history collections: A guideline and nationwide capacity building workshop in Malaysia. Ecology and Evolution 13: e10212. https://doi.org/10.1002/ece3.10212 Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 16 Owen D, Groom Q, Hardisty A, Leegwater T, Livermore L, van Walsum M, Wijkamp N, Spasić I (2020) Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections. Research Ideas and Outcomes 6: e58030. https://doi.org/10.3897/rio.6.e58030 Riyaz M, Ignacimuthu S (2023) Smart phone-macro lens setup (SPMLS): a low-cost and portable photography device for amateur taxonomists, biodiversity researchers, and citizen enthusiasts. Bulletin of the National Research Centre 47: 143 https://doi.org/10.1186/s42269-023-01120-y Smith V, Blagoderov V (2012) Bringing collections out of the dark. ZooKeys 209: 1-6. https://doi.org/10.3897/zookeys.209.3699 Schuh R (2012) Integrating specimen databases and revisionary systematics. ZooKeys 209: 255-267. https://doi.org/10.3897/zookeys.209.3288 Takano A, Cole TCH, Konagai H (2024) A novel automated label data extraction and data base generation system from herbarium specimen images using OCR and NER. Scientific Reports 14(1): 112. https://doi.org/10.1038/s41598-023-50179-0 Tann J, Flemons P (2008) Data capture of specimen labels using volunteers. Australian Museum. http://australianmuseum.net.au/Uploads/Documents/23183/Data%20Capture%2 0of%20specimen%20labels%20using%20volunteers%20- %20Tann%20and%20Flemons%202008.pdf [accessed 21.X.2024] Tegelberg R, Mononen T, Saarenmaa H (2014) High-Performance digitization of natural history collections: Automated imaging lines for herbarium and insect specimens. Taxon 63(6): 1307–1313. https://doi.org/10.12705/636.13 Weaver WN, Ruhfel BR, Lough KJ & Smith SA (2023) Herbarium specimen label transcription reimagined with large language models: Capabilities, productivity, and risks. American Journal of Botany, 110(12). https://doi.org/10.1002/ajb2.16256 Zhang Y (2023) Use of artificial intelligence (AI) in historical records transcription: Opportunities, challenges, and future directions. Master thesis, McGill University, 24pp. Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 17 Table 1: Summary of label configuration/ view (with reference to Figure 1 and 2) and the obtained resulting text in the final database. Text corrected by real-time manual corrections are indicated in Bold. Label configuration/ view Text as pasted from computer’s clipboard Verbatim finalized data (after manual correction) Figure 1A (labels scanned on pin, distorted) Belivr vista Peretra、インタ Museum Frey Tutzing Ex Coll. Frey, Basel, Switzer “Bolivia Buenavista Pereira XI.48 / Museum Frey Tutzing/ Ex Coll. Frey, Basel, Switzerland” (CF). Figure 1B (labels scanned separately, not distorted) Bolivia Buengvista Pereira X198 Ex Coll. Frey, Basel, Switzerland Museum Frey Tutzing “Bolivia Buenavista Pereira XI.48 / Ex Coll. Frey, Basel, Switzerland/ Museum Frey Tutzing” (CF). Figure 1C (partly handwritten labels scanned on pin, distorted) North IRAQ, KURDISTAN Duhok, Akre, Bjeel 2.V.2018, leg.1.H.Mudhafar Maladera del. D. Ahrens 2023 “North IRAQ, KURDISTAN Duhok, Akre, Bjeel 2.V.2018, leg.1.H.Mudhafar/ Maladera insanbilis (Brsk.) det. D. Ahrens 2023” Figure 1D (partly handwritten labels scanned separately, not distorted) Maladus dusanabilis (Boy) det. D. Ahrens 2023 Maladera insanabilis (Brsk) det. D. Ahrens 2023 Figure 2C Tucuman: Argentina. H.E.Box. Β.Μ.1930-238. Est. Expt Agric. No 2486 TUCUMAN 101/ AHRosenfeld Collector “Tucuman: Argentina. H.E.Box. Β.Μ.1930- 238./ Est. Expt. Agric. No 2486/ TUCUMAN XI-I 191/ A H Rosenfeld Collector/ Astaena argentina Moser/ Ex Coll. Frey, Basel, Switzerland/ Museum Frey Tutzing“ Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 18 Astaena argentina Moser Ex Coll. Frey, Basel, Switzerland Museum Frey Tutzing Figure 2D Argentiniel w.Wittmer L. Cabral Coral Salta 1160m 3.XII.1985 Ex Coll.NHM Basel, Switzerland “Argentinien W. Wittmer/ L. Cabral Coral Salta 1160m 3.XII.1985/ Ex Coll. NHM Basel, Switzerland” (NHMB) Figure 3A 四川:峨嵋山چہ 19573131 中國科學院 “四川:峨嵋山 1957.VII.31 中國科學 院” Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 19 Figure 1. Exemplary specimens used for experimental real-time label scans: A - (printed labels scanned on pin); B - (printed labels scanned separately); C - (partly handwritten labels scanned on pin); D - (partly handwritten labels scanned separately). Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 20 Figure 2. Steps of scanning (exemplified by a screenshot from mobile phone) of real- time data collection, and examples of labels: A – step 1: marking of the text to be captured via touch screen of the mobile phone (example - printed labels scanned on pin); B – step 2: select from menu bar (at the right side under three dots) “Copy to computer” (example - printed labels scanned separately). As to be seen, different labels at different levels on the pin can be scanned simultaneously and do not need to be removed from the pin; C – Screenshot showing the capture of multidirectional printed labels scanned separately from the specimen in Google Lens; D - Screenshot showing the capture of multiple distorted, printed labels scanned on the pinned specimen in Google Lens; E - Screenshot showing the initial capture of a printed label scanned separately from the specimen in Google Keep; F - Screenshot showing the extracted data resulting from E. Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 21 Figure 3. Other exemplary specimens used for experimental label scans: A – for Chinese language labels (printed); B - The printed Herpetology collection label that was scanned in the test of the Apple Shortcuts app algorithm. Note the incomplete text in the third text line and the cut off text "image 0355" below (compare to the corresponding data entries in C); C - Screenshot of the automatically scanned collection label as transferred into cells of the spreadsheet app Numbers. Although the text scan was very reliable, incomplete text will need editing: the somewhat cut off text "image 0355" of the label was interpreted as "Tmaee 0355". The time stamp in the first column corresponds to the file name of the respective photo saved as backup in the Shortcuts directory. Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 22 Figure 4. iOS Shortcuts app algorithm. From top to bottom: The first step will open the iPhone's Camera app and lets you photograph the label. The photo (“LABEL”) is then resized (optional, to reduce space) and saved in the background to the Shortcuts directory in your iCloud account with the current date (and time) as file name. Then the text is extracted from the photo and stored to a text container. The next step opens the spreadsheet "Test" in app Numbers; an empty target spreadsheet file (here: "Test") must be prepared beforehand and waiting in the Shortcuts folder of your iCloud account. Current Date and Text items are then collected in the “List”. The List items are finally entered int different columns in the spreadsheet file "Test" and a sheet with the name "A". Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113 23 Figure 5. Screenshot of bulk-scanned labels via Google keep, inspected afterwards directly from the computer interface, during the step of copying to of the label text to a Google document (interface here in Portuguese). Author-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-24T02:00:01.246996+00:00
License: CC-BY-4.0