{"paper_id":"2323f6d3-37d5-4a82-8fa4-252df5fcec4f","body_text":"PREPRINT\nAuthor-formatted, not peer-reviewed document posted on 06/11/2024\nDOI: https://doi.org/10.3897/arphapreprints.e141113\nExtracting specimen label data rapidly with a\nsmartphone – a great help for simple digitization in\ntaxonomy and collection management\n Dirk Ahrens,  Alexander Haas, Thaynara L. Pacheco,  Peter Grobe\n\n1  \nShort Communication: \n \nExtracting specimen label data rapidly with a smartphone – a great \nhelp for simple digitization in taxonomy and collection management \n \nDirk Ahrens1*, Alexander Haas2, Thaynara L. Pacheco1 & Peter Grobe1 \n \n1Museum A. Koenig; Leibniz Institute for the Analysis of Biodiversity Change (LIB), \nAdenauerallee 127, 53113 Bonn, Germany \n2 Museum of Nature Hamburg; Leibniz Institute for the Analysis of Biodiversity Change, \nMartin-Luther-King-Platz 3, 20146 Hamburg, Germany \n*Corresponding author; E-Mail: D.Ahrens@leibniz-lib.de; ahrens.dirk_col@gmx.de \n \nDirk Ahrens: https://orcid.org/0000-0003-3524-7153 \nAlexander Haas: https://orcid.org/0000-0002-3961-518X \nThaynara L. Pacheco: https://orcid.org/0000-0001-9503-7751 \nPeter Grobe: https://orcid.org/0000-0003-4991-5781 \n \nAbstract \nHere we provide short tutorials to read out specimen label data from type- as well as \nhandwritten labels in a rapid and easy way with a mobile phone. We apply this in \ngeneral, but test this in particular for insect specimen labels, which are generally quite \nsmall. We provide alterative procedure instructions for Android and Apple based \nenvironments, as well as protocols for single and bulk scans. We expect that this way of \ndata capture will be of great help for a simple digitization in taxonomy and collection \nmanagement, off the large industrial digitization pipelines. With omitting the step of \ntaking/maintaining images of the labels, this approach is more rapid, cheaper, and \nenvironmentally more sustainable because no storage with carbon footprint is required \nfor label images. The biggest advantage of this protocol is the use of readily available \ncommercial devices, which are easy to handle as they are used on a daily basis and \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n2  \ncan be replaced at relatively low cost when they come into (informatic) age, which is \nalso a matter of cyber security. \n \n \nKeywords \nCollection digitization, labels, label transcription, taxonomic revisions, artificial \nintelligence, citizen science, taxonomic impediment, data science \n \n \nIntroduction \n \nCurrently, there are immense efforts on the way to digitize natural history collections on \na large scale, including the associated information and metadata (e.g., Smith & \nBlagoderov 2012; Hardisty et al. 2020; Belot et al. 2023; Groom et al. 2023; Ong et al. \n2023). In these endeavors, among other things the automatic capture of label data plays \na central role (e.g., Beaman et al. 2006; Heidorn & Wei 2008; Lafferty and Landrum \n2009; Granzow-de la Cerda and Beach 2010; Haston et al. 2012; Agarwal et al. 2018; \nAlzuru et al. 2019, 2020; Alzuru 2020; Owen et al. 2020; Belot et al. 2023; Takano et al. \n2024; Zhang 2023). However, many of these very promising activities have been for \nlong exclusive to large companies, museums or institutions with specialized technical \ninfrastructure and special trained staff (e.g., Blagoderov et al. 2012) for the highly \ncustomized implementations used (e.g., https://picturae.com/). \nMost of the current digitization initiatives aim at a one-go retro-digitization of large \ncollections (Engledow et al. 2018; Hardisty et al. 2020; Helminger et al. 2020; De Smedt \net al. 2024). However, this approach comes with limitations: 1) Collections are \ncontinuously growing and developing (see also Balke et al. 2013); 2) the scientific \ncommunity produces a large amount of high-quality biodiversity data independently of \nthe collection institutions with their ongoing research on the specimens, in which \namateur scientists are also largely involved (Löbl et al. 2023). The latter is connected \nwith the often-remote study of the collection material, off the collections and large \ndigitization pipelines. Especially in insects, taxonomic specialists are rare, and \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n3  \nspecimens are often loaned by shipment overseas to obtain best IDs from world leading \nspecialists. In this, working processes are quite different from those of vertebrates or \nplants having often the lead in new methodologies, such as large-scale digitization. \nHowever, these data often do not yet end up in big data repositories, also due to the \nlack of time and stimulus as well as the work-overload of the taxonomists.  \nTherefore, more flexible solutions are needed which allow a more efficient data \nprocessing and that allow to speed up biodiversity/ species discovery and help to \novercome taxonomic impediment. This would be perfectly in line with the idea of \nintegrating specimen databases and revisionary systematics (Schuh 2012). Advantages \nof a revision-based digitization (see also Meier and Dikow 2004) in contrast to a retro-\ndigitization, i.e. that biodiversity data come from taxonomic revisionary studies–rather \nthan from uncritical digitizing of museum specimen data, are the following (extended, \nbased on Meier and Dikow (2004) as well as Schuh (2012)): 1) the data are provided in \nassociation with the most accurate identifications, 2) the data have the most complete \ntaxonomic and geographic coverage, 3) and the data satisfy these points in a cost-\neffective way, 4) data for occurrences and images are citable and acknowledgeable \n(therefore, errors can be retracted and be corrected). \n \nRecently we came across, that mobile devices used to be in the hand of almost every \nperson may assist in this aim to speed up data collection and digitization including \nbiodiversity discovery. By simple playful experimenting, we discovered, how useful \nmobile phones can be in association with cloud-like environments (such as Google or \nApple). Since we think that these “workflows” can be really useful for a large audience, \nwe prepared this short paper to disseminate the(se) simple tutorial(s) for how to read \nout specimen label data in a rapid and easy way with a smartphone. \nMost digitization approaches envision the capture of digital metadata (e.g., labels) with \nthe intermediate step of digital images (Nelson et al. 2012). This comes with other \ndifficulties and quite considerable costs for image processing and storage (Tann and \nFlemons 2008; Hardisty et al. 2020b). In an optimized balance of a cost-benefit ratio, it \nwould be therefore more sustainable to skip this step if data can be read out and being \nspell-check in the same moment without the burden of images. The latter are \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n4  \nscientifically and practically quite unnecessary (in terms of cost-benefit balance) for non-\ntype specimens. \n \n \nMaterial and methods \n \nResources needed \n \n1) A mobile phone (i.e. smartphone, not too old model with macro photography \noptions).  \n2) A stable Internet connection of the phone via WLAN or mobile telephone signal.  \n3) A computer connected with internet and logged into a Google account (via \nGoogle Chrome Browser) or AppleID account \n4) A data base/ text file to insert the specimen’s data. \n5) Google Lens or Google translator installed on the mobile phone. \n \nFor our testing here, we used a “Motorola G5g plus (system: Android 10)” and “iPhone \n15 Pro Max” (system: iOS 17.7). \nWe explored the data extraction form the labels with different approaches and \nalternative label conditions (Figure 1). Each of the different tutorials can be proven to be \nmore suitable for different technical situations of the user. We describe subsequently all \nof these in simplified step to step tutorials. Tutorials are accompanied by screen shots \nand examples of resulting data sheets. \n \n \n1. Operational system open: \n \nVariant 1 \n1) Open the Apps in your mobile phone: Google translator (/ Google Lens) \n2) Focus on the label to be scanned, eventually virtually zoom in via the touch \nscreen of your mobile phone, that the label(s) to scan are filling the screen as \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n5  \nbest as possible (no need to be super focused, important letters are to be \nrecognizable) \n3) Scan (snapshot with bottom ‘photo’) \n4) Mark the label text (Figure 1) via cursor selection by the touch screen of the \nmobile phone \n5) Select “Copy to Computer” \n6) Confirm the selected device (Computer with which you are logged into your \ngoogle account): by choosing “select” \n7) On your computer: simply paste from clipboard into your target document \n(verbatim label citation) \n8) Finally, you may proofread the scan (while having still your specimen in front of \nyou) and manually correct misspellings/ readings \n9) Finished. \n \n*Alternatively, in step 5 can be also chosen “copy” and the copied content can be \npasted into an open google word document on the mobile device. The latter could be \ndirectly accessed on the (via google account) synchronized computer. This step rarely is \nnecessary if the internet is overloaded, or the internet connection is too slow (see \nresults below). This also works outside of the Google Cloud environment but is a little \nmore complex: Files can be shared between Android, Windows or Mac devices using \nthe KDE Connect app (https://kdeconnect.kde.org). All devices must be in the same \nWIFI network. After installing the KDE Connect app, the text can be transferred to the \ncomputer. \n \nVariant 2 (bulk scans) \n1) Open the Google Keep – Notes and Lists app on your mobile phone. \n2) Clique on a picture icon. Focus on the label you want to scan. \n3) Click on “take photo” to capture the image and then on the checkmark icon to \nsave it. \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n6  \n4) Click on the image, then on the three dots in the upper right corner, and then on \n“grab image text”. The text will appear as a note and can be manually corrected \nfor spellings or readings errors. A title for the note can also be added. \n5) Repeat steps 2-4 for each different label you want to scan. They will be saved as \ndifferent notes. \n6) Select all notes, click on the three dots in the upper right corner, and then on \n“copy to Google Docs” (This step can be alternatively done already on the \ncomputer via the respective google account; see Figure 5). A single Word \ndocument containing all images and texts will be generated. (This step can be \ndone on your mobile or on a computer logged into your Google account) \n7) On your computer: open your Google Docs file, and the final corrections can be \nmade. \n \n \n2. With an “Apple-only” environment \n \nRequirements: Make sure you have a recent iPhone or iPad model with macro \nphotography capabilities and the most recent operating system (preferably iOS 15 and \nlater). You will also need a Mac computer and an Apple iCloud Account (at least the free \nversion). An internet connection of the phone (e.g., via WLAN) is not necessary for data \ncollection, if you collect your data from the specimen labels first on your phone (bulk \nscans) and go back to your Mac computer later.  \n \na) Using Notes app: \n1) Open the Notes app on your iPhone and set up a new note for your current \nproject. \n2) In your note, tap the camera symbol at the bottom and choose “scan text” from \nthe pop-up menu. A camera window opens in the bottom part of your note. \n3) Aim your camera at the text block you want to scan. Yellow brackets will show \nyou which text block the software sees as target. Once the desired target text is \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n7  \nwithin the brackets press the insert button at the bottom of the camera window. \nThe targeted text will be read and automatically transferred to your note.  \n4) Briefly check the result in your note. \n5) Go to the next line in your note and scan the next target text in the same way, \nthus accumulating information from multiple specimen labels or multiple \nspecimens as you like. \n6) Once happy with the collected data, return to your desktop Mac computer. If the \nphone had telephone connection with your provider while you took the scans or \non your way back to your desktop computer, the Notes app should automatically \nsynchronize with your Apple Account in the background so that when you open \nthe Notes app on your desktop computer, you should find all the scanned data \nthere. \n7) Continue to copy and paste the information accumulated in your Notes app to the \ndocument or database of your choice. \n \nb) Using the Shortcuts app: \nThe Shortcuts app of iOS can be used to program an automated process from \ntaking the photo, extracting the text and filling a table in Apple’s spreadsheet app \nNumbers. Make sure that your Shortcuts and Numbers apps are synchronized for \nall of your devices via your iCloud drive. We assembled a Shortcuts algorithm as \na proof of concept. Fig. 4 shows the algorithm.  \n \n \n3. Without internet connection using a Bluetooth-approach (using a Windows PC and a \nmobile phone with Android system) \n \n1) Download the app (Google keeps – Notes and Lists) on the mobile phone \n2) Open “Bluetooth” options in the computer \n3) Pair the devices (computer and mobile phone) \n4) Click on receive files via Bluetooth \n5) Open the app and click on the picture icon \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n8  \n6) Click on “take photo” and take the photo \n7) Click on the captured picture \n8) Click on the three dots in the upper right corner \n9) Click on “grab image text and select the extracted text \n10) Click on the three dots in the lower right corner and click on “send” \n11) Click on “send via other apps” and choose the Bluetooth symbol \n12) Choose a folder to save the html file in the computer \n13) Copy the text from the html-file into a text editor for final spelling corrections \n \nWe expect this approach to work in a similar way also in the Apple environment. \n \n \nResults \n \nIn Table 1 we summarized some major characteristics of the data capture with these \nmethods, showing directly pasted content and the necessary amount of real-time spell \ncorrections for the data. While for printed labels the need for subsequent spelling \ncorrections was minimal, handwritten labels needed often more corrections, depending \non the size and style of handwriting. In these cases, scanning the labels separately from \nthe pin without distortion helped quite much (Fig. 1A, B). In printed labels, direction (Fig \n2C) and distortion of labels did not matter much (Fig 2D). We were able to scan up to \nthree labels (from the distorted side view) still mounted at a pin and without flipping out \nthe labels or even to remove them (Fig 2D). \nSince low image resolution was not a problem, we could zoom-in digitally with the \nmobile phone into the labels until these were almost format filling. However, the initial \ntesting was successfully done also with much smaller images (Fig. 1A-D).  \nThe average processing time per specimen was very fast, the estimated time for full \ndata capture (including spell correction) was 3-10 seconds per specimen. Processing \ntime was often a little longer for badly handwritten labels, or when an insect pin or other \nlabels covered parts of the label text, or when the overall internet connection was slow. \nThe total time gain per label was larger with labels containing much information or with \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n9  \nmultiple labels. For example, in the ones shown in Figure 2B, typing the data by hand \ninto the computer requested 60 seconds (including spell check), the label scan with the \napproach 1 took 8 seconds (including manual spell check). For the data of Figure 2C, \nmanual typing and spell-check required 121 seconds, while the label scan with the \napproach 1 took 10 seconds (including manual spell check). We did refrain from larger \nexperiments on measuring comparatively the time, since duration of typing data \ndepends much on the typing skills of the person. The comparative numbers given here, \nrefer to a typing-untrained scientist (performed by D.A.). \nIn some instances, in approach 1, we had to use the deviation via a Google document* \ndue to bad internet connection, when the copy process failed due to slow data transfer. \nThis was then usually two “clicks” (or seconds) slower, but not really a big delay \ncompared to the amount of time required for manual typing. \n \nThe iPhone workflow test with was done with a larger label (Fig. 3B). In the workflow 2a) \nabove, using the Notes app on the Apple iPhone, the image recognition tried to identify \nand focus blocks of text within the label, but did not to capture the label as a whole. To \ncapture multiple bits of information the process had to be reiterated accordingly. Once \nthe data has been collected in Notes, further copy-paste editing is necessary to transfer \nthe data to a database. Workflow 2b), using Shortcuts app automation (Fig. 4) scans the \nwhole label and also stores the data with a timestamp directly into a spreadsheet app. \nFurthermore, the photos are stored in the user's Apple iCloud account (as backups for \npotential later reference), but this step is optional in the algorithm. The result of the \nscanning is shown in Fig. 3B, C. Note that incomplete text in the original caused \ninterpretation problems (truncated third line and partially hidden bottom part of \"image \n0355\"). In addition, the algorithm wants to place each recognized line of text in a \nseparate cell. If several lines belong to the same block of information, editing of the cells \nwas necessary. The scan of the label and filling of the cells in the spreadsheet took less \nthan 10 s. The algorithm analyzed the Label as lines of text and allocated one cell per \nline in the spreadsheet. This means that the locality information in our example was split \nup into two cells in our test. Depending on which further tasks the user wants to \naccomplish copy-paste processing of such splits will be necessary.  \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n10  \n \nThe approach using a Bluetooth connection between the mobile phone and the \ncomputer appeared to be slightly longer (by the amount of “device clicks”), however, yet \nsaved incredible amount of time for scanning the label data. Given the widely \nexperienced situations than many collection magazines are partly or entirely offline, or \nremotely working taxonomists might have difficulties to have a good internet connection. \n \nBulk approaches are available under the Google and Apple environment (see Fig. 4) \nwith the Google Keep and Notes applications, respectively. In both, images are \ntemporarily stored in the mobile devices, which can be subsequently either being saved \nof discarded. While they safe time with the data transfer, they have the disadvantage \nthat potentially incomplete scans are only discovered when the specimens are already \nout of hand. \n \n \nDiscussion \n \nWhile new technology including artificial intelligence is entering in our daily life, their use \nand application in biodiversity research is yet rather limited, although there have been \ndeveloped approaches to using AI-powered label recognition (Johaadien 2023, Takano \net al. 2024, Waever et al. 2023). Similar smartphone tutorials have been already \nprovided for specimen photography (Riyaz & Ignacimuthu 2023), although maybe \nalready being widely in use without being formally addressed in the scientific literature. \nHere we addressed the scanning of label information using a smartphone under \ndifferent operating systems. According to our knowledge, this has not been so far \nexplored and applied particularly with insect collection specimens. There are solutions \nfor large-scale mass digitization of collections (Belot et al. 2023; Blagoderov 2012; \nEngledow et al. 2018; Tegelberg et al. 2014). All of these solutions require manual \nseparation of specimens and labels in order to photograph them separately. Initial trials \nwith robotic technology (e.g. Dupont & Price 2019) are promising but can only be used \nby larger institutions with the appropriate budget. \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n11  \nWith partly omitting the so far obligatory step of taking and permanently storing images \nof the labels, this direct approach of data capture is more rapid and environmentally \nmore sustainable. In a part of our procedures, this happens nevertheless without delay \nin the background and there is the option to retain the images or to discard them. \nEspecially, for a simple distribution data extraction in the framework of taxonomic \nrevisions or faunistic studies, there is scientifically no necessity to hold images of the \nmetadata labels of every specimen long term. Moreover, the spell-checking of the \nscanned and extracted data can be done yet with the specimen at hand, with the data \nfinalized once and for all after the first processing.  \nHowever, depending on the individual needs and working conditions, the user has the \nchoice on the individual workflow. It is possible to scan 50 labels in a row (i.e., bulk \nworkflow) before transferring the data to the computer. Then in some critical cases, \nhaving a backup photo is good for quality assessment and spell-check. \nOne other great advantage is, that these protocols use commercial devices which are \nsimple to handle, and which are for little costs to replace when they come into \n(informatic) age which is also a matter of cybersecurity. Unfortunately, in biosystematics \nwe have been make often the experience that customized devices are overpriced, often \nbehind the technological advances (e.g. computer operational systems) requiring often \nexpensive updates and service.  \nSince biodiversity research is also done by a great portion of amateur scientists (and \neven professionals always lack funding for their “descriptive research”), these people do \nnot have access to large or continuous funding. \nOther consequences: The high reliability of text recognition and the rapid data transfer \nmake the use of (only-)machine readable barcode labels and QR codes in collection \nmanagement superfluous since connected data can be easily inferred from numerical \nvoucher numbers on labels. \nOur solutions and tutorial proposed here are very well suited for the fast and secure \nrecording of small quantities of collection objects, e.g. when visiting a collection or when \nselecting individual objects. We are aware that habits, skills and specific workflows \ninfluence the way we integrate such devices and text recognition capabilities. We are \nconvinced that they will make a significant contribution and help to alleviate the \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n12  \ntaxonomic impediment (e.g., de Carvalho et al. 2005, 2007; Engel et al. 2021), as the \nworkload for taxonomist recording the material they study in databases will be reduced \nby at least tenfold. \nFinally, it should be said, that there might be even more options and possibilities to scan \nlabels with mobile devices. These options might evolve as quickly as mobile phones \nand artificial intelligence technology, in general. Nevertheless, we expect the potential \nuser to take this paper as an inspiration to continue exploring options on how to apply \nthis technology successfully in their established workflows. \n \n \nAcknowledgements \n \nWe are thanking the numbers colleagues who helped with discussing the argument and \nwhose outcome encouraged use to pursue with this article. \n \n \nReferences \n \nAgarwal N, Ferrier N, Hereld M (2018) Towards automated transcription of label text \nfrom pinned insect collections. 2018 IEEE Winter Conference on Applications of \nComputer Vision (WACV), Lake Tahoe, NV, USA, pp. 189-198, doi: \n10.1109/WACV.2018.00027. \nAlzuru I (2020) Human-machine extraction of Information from biological collections. \nPhD thesis, University Florida, 160pp. \nAlzuru I, Malladi A, Matsunaga A, Tsugawa M, José FAB. (2019) Human-Machine \nInformation Extraction Simulator for Biological Collections. 2019 IEEE \nInternational Conference on Big Data (Big Data), Los Angeles, CA, USA, pp. \n4565-4572, doi: 10.1109/BigData47090.2019.9005601. \nAlzuru I, Matsunaga A, Tsugawa M, Fortes JAB (2020) General Self-aware Information \nExtraction from Labels of Biological Collections. 2020 IEEE International \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n13  \nConference on Big Data (Big Data), Atlanta, GA, USA, pp. 3035-3044, doi: \n10.1109/BigData50022.2020.9377737. \nBalke M, Schmidt S, Hausmann A et al. (2013) Biodiversity into your hands - A call for a \nvirtual global natural history ‘metacollection’. Frontiers in Zoology 10: 55 \nhttps://doi.org/10.1186/1742-9994-10-55 \nBeaman RS, Cellinese N. Heidorn PB, Guo Y, Green AM, Thiers B (2006) HERBIS: \nIntegrating digital imaging and label data capture for herbaria [Abstract]. Botany \n2006, California State University – Chico. 28 July–2 August 2006. \nhttp://www.2006.botanyconference.org/engine/search/index.php?func=detail&aid\n=402. \nBelot M, Preuss L, Tuberosa J, Claessen M, Svezhentseva O, Schuster F, Bölling C, \nLéger T (2023) High Throughput Information Extraction of Printed Specimen \nLabels from Large-Scale Digitization of Entomological Collections using a Semi-\nAutomated Pipeline. Biodiversity Information Science and Standards 7: e112466. \nhttps://doi.org/10.3897/biss.7.112466 \nBlagoderov V, Kitching I, Livermore L, Simonsen T, Smith VS (2012) No specimen left \nbehind: industrial scale digitization of natural history collections. ZooKeys 209: \n133-146. https://doi.org/10.3897/zookeys.209.3178 \nde Carvalho MR, Bockmann FA, Amorim DS, de Vivo M, de Toledo-Piza M, Menezes \nNA, de Figueiredo JL, McEachran JD (2005) Revisiting the taxonomic \nimpediment. Science 307: 353-353. DOI:10.1126/science.307.5708.353b \nde Carvalho MR, Bockmann FA, Amorim DS et al. (2007) Taxonomic Impediment or \nImpediment to Taxonomy? A Commentary on systematics and the \ncybertaxonomic-automation paradigm. Evolutionary Biology 34: 140–143 \nhttps://doi.org/10.1007/s11692-007-9011-6  \nDe Smedt S, Bogaerts A, De Meeter N, Dillen M, Engledow H, Van Wambeke P, Leliaert \nF, Groom Q (2024) Ten lessons learned from the mass digitisation of a herbarium \ncollection. PhytoKeys 244: 23-37. https://doi.org/10.3897/phytokeys.244.120112 \nDupont S, Price BW (2019) ALICE, MALICE and VILE: High throughput insect specimen \ndigitisation using angled imaging techniques. Biodiversity Information Science \nand Standards 3: e37141. https://doi.org/10.3897/biss.3.37141 \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n14  \nEngel MS, Ceríaco LMP, Daniel GM, et al. (2021) The taxonomic impediment: a \nshortage of taxonomists, not the lack of technical approaches. Zoological Journal \nof the Linnean Society 193(2): 381–387. \nhttps://doi.org/10.1093/zoolinnean/zlab072 \nEngledow H, De Smedt S, Groom Q, Bogaerts A, Stoffelen P, Sosef M, Van Wambeke P \n(2018) Managing a mass digitization project at Meise Botanic Garden: From start \nto finish. Biodiversity Information Science and Standards 2: e25912. \nhttps://doi.org/10.3897/biss.2.25912 \nGranzow-de la Cerda Í, Beach JH (2010) Semi-automated workflows for acquiring \nspecimen data from label images in herbarium collections. Taxon 59: 1830-1842. \nhttps://doi.org/10.1002/tax.596014 \nGroom Q, Dillen M, Addink W, Ariño AHH, Bölling C, Bonnet P, Cecchi L, Ellwood ER, \nFigueira R, Gagnier P-Y, Grace OM, Güntsch A, Hardy H, Huybrechts P, Hyam R, \nJoly AAJ, Kommineni VK, Larridon I, Livermore L, Lopes RJ, Meeus S, Miller JA, \nMilleville K, Panda R, Pignal M, Poelen J, Ristevski B, Robertson T, Rufino AC, \nSantos J, Schermer M, Scott B, Seltmann KC, Teixeira H, Trekels M, Gaikwad J \n(2023) Envisaging a global infrastructure to exploit the potential of digitised \ncollections. Biodiversity Data Journal 11: e109439. \nhttps://doi.org/10.3897/BDJ.11.e109439  \nHaston E, Cubey RWN, Pullan M, Atkins H, Harris D (2012) Developing integrated \nworkflows for the digitisation of herbarium specimens using a modular and \nscalable approach. ZooKeys 209: 93-102. \nhttps://doi.org/10.3897/zookeys.209.3121 \nHardisty A, Saarenmaa H, Casino A, Dillen M, Gödderz K, Groom Q, Hardy H, Koureas \nD, Nieva de la Hidalga A, Paul DL, Runnel V, Vermeersch X, van Walsum M, \nWillemse L (2020a) Conceptual design blueprint for the DiSSCo digitization \ninfrastructure - DELIVERABLE D8.1. Research Ideas and Outcomes 6: e54280. \nhttps://doi.org/10.3897/rio.6.e54280 \nHardisty A, Livermore L, Walton S, Woodburn M, Hardy H (2020b) Costbook of the \ndigitisation infrastructure of DiSSCo. Research Ideas and Outcomes 6: e58915. \nhttps://doi.org/10.3897/rio.6.e58915  \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n15  \nHeidorn PB, Wei Q (2008) Automatic metadata extraction from museum specimen \nlabels. Pp. 57–68 in: Greenberg, J. & Klas, W. (eds.), Metadata for semantic and \nsocial applications: Proceedings of the International Conference on Dublin Core \nand Metadata Applications, Berlin, 22–26 September 2008, DC 2008: Berlin, \nGermany. Göttingen: Universitätsverlag Göttingen. \nHelminger T, Weber O, Braun P (2020) Digitisation of the LUX herbarium collection of \nthe National Museum of Natural History Luxembourg. Bulletin de la Société des \nnaturalists luxembourgeois 122: 147-152. \nJohaadien R, Torma M (2023) “Publish First”: A Rapid, GPT-4 based digitisation system \nfor small institutes with minimal resources. Biodiversity Information Science and \nStandards 7: e112428. https://doi.org/10.3897/biss.7.112428 \nLafferty D, Landrum LR (2009) SALIX, a semi-automatic label information extraction \nsystem using OCR [Abstract]. Botany & Mycology 2009, Snowbird, Utah, 25–29 \nJuly 2009. \nhttp://2009.botanyconference.org/engine/search/index.php?func=detail&aid=130 \n(accessed 21.X.2024). \nLöbl I, Klausnitzer B, Hartmann M (2022) Das stille Aussterben von Arten und \nTaxonomen – ein Appell an Wissenschaftspolitik und Legislative. Entomologische \nNachrichten und Berichte 66(3): 217-226. \nMeier R & Dikow T (2004) Significance of specimen databases from taxonomic \nrevisions for estimating and mapping the global species diversity of invertebrates \nand repatriating reliable specimen data. Conservation Biology 18: 478-488. \nhttps://doi.org/10.1111/j.1523-1739.2004.00233.x \nNelson G, Paul D, Riccardi G, Mast A (2012) Five task clusters that enable efficient and \neffective digitization of biological collections. ZooKeys 209: 19-45. \nhttps://doi.org/10.3897/zookeys.209.3135 \nOng S-Q, Mat Jalaluddin, NS, Yong KT, Ong SP, Lim KF, Azhar S (2023) Digitization of \nnatural history collections: A guideline and nationwide capacity building workshop \nin Malaysia. Ecology and Evolution 13: e10212. \nhttps://doi.org/10.1002/ece3.10212 \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n16  \nOwen D, Groom Q, Hardisty A, Leegwater T, Livermore L, van Walsum M, Wijkamp N, \nSpasić I (2020) Towards a scientific workflow featuring Natural Language \nProcessing for the digitisation of natural history collections. Research Ideas and \nOutcomes 6: e58030. https://doi.org/10.3897/rio.6.e58030 \nRiyaz M, Ignacimuthu S (2023) Smart phone-macro lens setup (SPMLS): a low-cost \nand portable photography device for amateur taxonomists, biodiversity \nresearchers, and citizen enthusiasts. Bulletin of the National Research Centre \n47: 143 https://doi.org/10.1186/s42269-023-01120-y \nSmith V, Blagoderov V (2012) Bringing collections out of the dark. ZooKeys 209: 1-6. \nhttps://doi.org/10.3897/zookeys.209.3699 \nSchuh R (2012) Integrating specimen databases and revisionary systematics. ZooKeys \n209: 255-267. https://doi.org/10.3897/zookeys.209.3288 \nTakano A, Cole TCH, Konagai H (2024) A novel automated label data extraction and \ndata base generation system from herbarium specimen images using OCR and \nNER. Scientific Reports 14(1): 112. https://doi.org/10.1038/s41598-023-50179-0 \nTann J, Flemons P (2008) Data capture of specimen labels using volunteers. Australian \nMuseum. \nhttp://australianmuseum.net.au/Uploads/Documents/23183/Data%20Capture%2\n0of%20specimen%20labels%20using%20volunteers%20-\n%20Tann%20and%20Flemons%202008.pdf [accessed 21.X.2024] \nTegelberg R, Mononen T, Saarenmaa H (2014) High-Performance digitization of natural \nhistory collections: Automated imaging lines for herbarium and insect specimens. \nTaxon 63(6): 1307–1313. https://doi.org/10.12705/636.13 \nWeaver WN, Ruhfel BR, Lough KJ & Smith SA (2023) Herbarium specimen label \ntranscription reimagined with large language models: Capabilities, productivity, \nand risks. American Journal of Botany, 110(12). \nhttps://doi.org/10.1002/ajb2.16256 \nZhang Y (2023) Use of artificial intelligence (AI) in historical records transcription: \nOpportunities, challenges, and future directions. Master thesis, McGill University, \n24pp. \n  \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n17  \nTable 1: Summary of label configuration/ view (with reference to Figure 1 and 2) and \nthe obtained resulting text in the final database. Text corrected by real-time manual \ncorrections are indicated in Bold. \n \nLabel configuration/ \nview \nText as pasted from computer’s \nclipboard \nVerbatim finalized \ndata (after manual \ncorrection) \nFigure 1A (labels \nscanned on pin, \ndistorted) \nBelivr vista Peretra、インタ \n \nMuseum Frey \n \nTutzing \n \nEx Coll. Frey, Basel, Switzer \n \n“Bolivia Buenavista \nPereira XI.48 / \nMuseum Frey \nTutzing/ Ex Coll. Frey, \nBasel, Switzerland” \n(CF). \nFigure 1B (labels \nscanned separately, \nnot distorted) \nBolivia Buengvista Pereira X198 \n \nEx Coll. Frey, Basel, \n \nSwitzerland \n \nMuseum Frey Tutzing \n“Bolivia Buenavista \nPereira XI.48 / Ex \nColl. Frey, Basel, \nSwitzerland/ Museum \nFrey Tutzing” (CF). \nFigure 1C (partly \nhandwritten labels \nscanned on pin, \ndistorted) \nNorth IRAQ, KURDISTAN Duhok, \nAkre, Bjeel 2.V.2018, \nleg.1.H.Mudhafar \n \nMaladera \n \ndel. D. Ahrens 2023 \n \n“North IRAQ, \nKURDISTAN Duhok, \nAkre, Bjeel 2.V.2018, \nleg.1.H.Mudhafar/ \nMaladera insanbilis \n(Brsk.) det. D. \nAhrens 2023” \nFigure 1D (partly \nhandwritten labels \nscanned separately, \nnot distorted) \nMaladus dusanabilis (Boy) \n \ndet. D. Ahrens 2023 \nMaladera insanabilis \n(Brsk) det. D. Ahrens \n2023 \nFigure 2C Tucuman: \n \nArgentina. H.E.Box. Β.Μ.1930-238. \n \nEst. Expt \n \nAgric. No 2486 \n \nTUCUMAN 101/ \n \nAHRosenfeld Collector \n \n“Tucuman: Argentina. \nH.E.Box. Β.Μ.1930-\n238./ Est. Expt. Agric. \nNo 2486/ TUCUMAN \nXI-I 191/ A H \nRosenfeld Collector/ \nAstaena argentina \nMoser/ Ex Coll. Frey, \nBasel, Switzerland/ \nMuseum Frey \nTutzing“ \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n18  \nAstaena argentina Moser \n \nEx Coll. Frey, Basel, Switzerland \n \nMuseum Frey Tutzing \nFigure 2D Argentiniel w.Wittmer \n \nL. Cabral Coral \n \nSalta 1160m \n \n3.XII.1985 \n \nEx Coll.NHM \n \nBasel, Switzerland \n“Argentinien W. \nWittmer/ L. Cabral \nCoral Salta 1160m \n3.XII.1985/ Ex Coll. \nNHM Basel, \nSwitzerland” (NHMB) \nFigure 3A 四川:峨嵋山چہ \n \n19573131 \n \n中國科學院 \n \n \n \n“四川:峨嵋山 \n1957.VII.31 中國科學\n院” \n \n \n  \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n19  \n \n \nFigure 1. Exemplary specimens used for experimental real-time label scans: A - \n(printed labels scanned on pin); B - (printed labels scanned separately); C - (partly \nhandwritten labels scanned on pin); D - (partly handwritten labels scanned separately). \n \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n20  \n \n \nFigure 2. Steps of scanning (exemplified by a screenshot from mobile phone) of real-\ntime data collection, and examples of labels: A – step 1: marking of the text to be \ncaptured via touch screen of the mobile phone (example - printed labels scanned on \npin); B – step 2: select from menu bar (at the right side under three dots) “Copy to \ncomputer” (example - printed labels scanned separately). As to be seen, different labels \nat different levels on the pin can be scanned simultaneously and do not need to be \nremoved from the pin; C – Screenshot showing the capture of multidirectional printed \nlabels scanned separately from the specimen in Google Lens; D - Screenshot showing \nthe capture of multiple distorted, printed labels scanned on the pinned specimen in \nGoogle Lens; E - Screenshot showing the initial capture of a printed label scanned \nseparately from the specimen in Google Keep; F - Screenshot showing the extracted \ndata resulting from E. \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n21  \n \n \nFigure 3. Other exemplary specimens used for experimental label scans: A – for \nChinese language labels (printed); B - The printed Herpetology collection label that was \nscanned in the test of the Apple Shortcuts app algorithm. Note the incomplete text in the \nthird text line and the cut off text \"image 0355\" below (compare to the corresponding \ndata entries in C); C - Screenshot of the automatically scanned collection label as \ntransferred into cells of the spreadsheet app Numbers. Although the text scan was very \nreliable, incomplete text will need editing: the somewhat cut off text \"image 0355\" of the \nlabel was interpreted as \"Tmaee 0355\". The time stamp in the first column corresponds \nto the file name of the respective photo saved as backup in the Shortcuts directory.  \n \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n22  \n \nFigure 4. iOS Shortcuts app algorithm. From top to bottom: The first step will open the \niPhone's Camera app and lets you photograph the label. The photo (“LABEL”) is then \nresized (optional, to reduce space) and saved in the background to the Shortcuts \ndirectory in your iCloud account with the current date (and time) as file name. Then the \ntext is extracted from the photo and stored to a text container. The next step opens the \nspreadsheet \"Test\" in app Numbers; an empty target spreadsheet file (here: \"Test\") must \nbe prepared beforehand and waiting in the Shortcuts folder of your iCloud account. \nCurrent Date and Text items are then collected in the “List”. The List items are finally \nentered int different columns in the spreadsheet file \"Test\" and a sheet with the name \n\"A\".  \n \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113\n\n23  \n \nFigure 5. Screenshot of bulk-scanned labels via Google keep, inspected afterwards \ndirectly from the computer interface, during the step of copying to of the label text to a \nGoogle document (interface here in Portuguese). \nAuthor-formatted, not peer-reviewed document posted on 06/11/2024. DOI:  https://doi.org/10.3897/arphapreprints.e141113","source_license":"CC-BY-4.0","license_restricted":false}