{"paper_id":"cba56158-4e9c-4255-9d64-8e5f4cdcfae4","body_text":"1 \nOn the path to reference genomes for all biodiversity: lessons learned and \nlaboratory protocols created in the Sanger Tree of Life core laboratory over the \nfirst 2000 species  \n \nCaroline Howard 1,*, Amy Denton 1, Benjamin Jackson 1, Adam Bates 1,2, Jessie Jay 1,3, Halyna \nYatsenko1, Priyanka Sethu Raman 1, Abitha Thomas 1, Graeme Oatley 1, Raquel Vionette do \nAmaral1, Zeynep Ene Göktan 1, Juan Pablo Narváez Gómez 1,4, Isabelle Clayton Lucey 1, \nElizabeth Sinclair1, Michael A. Quail5, Mark Blaxter1, Kerstin Howe1, Mara K. N. Lawniczak1 \n \n1. Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge. CB10 1SA, United Kingdom \n2. Department of Psychiatry, Warneford Hospital, University of Oxford, Oxford, OX3 7JX, \nUnited Kingdom \n3. Ancient Genomics Lab, The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, \nUnited Kingdom \n4. Eukaryotic Annotation Team, EMBL-EBI, Wellcome Genome Campus, Hinxton, \nCambridge, CB10 1SD, United Kingdom \n5. Scientific Operations, Wellcome Sanger Institute, Hinxton, CB10 1SA, United Kingdom \n* ch25@sanger.ac.uk \n \nKeywords: reference genome, HMW DNA, extraction, sequencing, biodiversity, plant, \narthropod, fungi, chordate, protist, long read, Hi-C, protocol,  \nAbstract \nSince its inception in 2019, the Tree of Life programme at the Wellcome Sanger Institute has \nreleased high-quality, chromosomally-resolved reference genome assemblies for over 2000 \nspecies. Tree of Life has at its core multiple teams, each of which are responsible for key \ncomponents of the ‘genome engine’. One of these teams is the Tree of Life core laboratory, \nwhich is responsible for processing tissues across a wide range of species into high quality, \nhigh molecular weight DNA and intact RNA, and preparing tissues for Hi-C. Here, we detail \nthe different workflows we have developed to successfully process a wide variety of species, \ncovering plants, fungi, chordates, protists, arthropods, meiofauna and other metazoa. We \nsummarise our success rates and describe how to best apply and combine the suite of \ncurrent protocols, which are all publicly available at protocols.io. \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n2 \nBackground \nIn recent years, advances in long read sequencing technologies have enabled genome \nassembly to an unprecedented quality and quantity. These advances underpin the goal of \nthe Earth BioGenome Project (EBP), which is to create high quality reference genomes for \nall described eukaryotic species [1]. This ambitious project faces many challenges from \ncollecting and identifying species at scale, to extracting sufficiently high quality and quantity \nof DNA and RNA from a wide range of taxa, to sequencing, assembling and annotating \nextraordinarily diverse genomes. It is this central DNA extraction challenge that we address \nhere, alongside sharing all protocols that enable our work. The EBP goal will only be met \nthrough open and rapid sharing of key protocols and pipelines.  \nThe Tree of Life (ToL) programme at the Wellcome Sanger Institute is a major contributor to \nEBP goals. Over the past five years, we have extracted DNA and RNA from 41 phyla \nrepresenting 4883 species under projects such as Darwin Tree of Life [2] and Aquatic \nSymbiosis Genomics [3]. We have released dozens of protocols at the Sanger Tree of Life \nWorkspace on protocols.io to assist others in their efforts to carry out the laboratory work \nnecessary to generate high quality reference genomes. These protocols for tissue \npreparation, high molecular weight (HMW) DNA extraction, fragmentation and clean-up, and \nRNA extraction have been applied at scale with standardised quality control (QC) \nmeasurements at key stages. Here we share both the routine processes that we employ as \na first pass for organisms from a variety of different taxonomic groups as well as the \napproaches we take when we encounter failures. We also share things that we have learned \nalong the way regarding specific challenges presented by different taxonomic groups, \nsample types, and species. The work presented here provides a summary of our first five \nyears of work, with a frozen data set [4] used to enable us to provide success rates and \nreview progress. Work in all of these areas is also currently ongoing, with new species being \nprocessed daily and new protocols developed to improve output and efficiency. \nDeveloping a standardised pipeline for the processing of diverse biological samples \nfor reference genome assembly \nThe path from specimen to genome assembly requires optimal execution of complex wet \nlab, sequencing and informatic processes. In ToL, we use long read genomic sequencing \nand Hi-C chromatin conformation sequencing for reference genome assembly, and produce \ntranscriptomic data through short read RNA-seq for primary annotation of completed \ngenomes. Focussing on the wet-laboratory work, we have standardised the processes to \nallow progress at pace. In general, the laboratory steps required for generation of \nhigh-quality reference genomes are: \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n3 \n1. Sample Preparation: This step involves assessment of extremely diverse samples \nvarying in size, density, morphology, and chemistry. Tissue typically will undergo \nsome kind of homogenisation and be aliquoted for different pipelines (RNA \nextraction, DNA extraction, and Hi-C).  \n2. HMW DNA extraction: The protocols associated with this step comprise the most \ndiverse set of protocols depending on the target taxon and the nature of the available \ntissue. \n3. HMW DNA Fragmentation: Our primary long read sequencing approach over the \npast five years has been PacBio HiFi, and this requires molecules in the 12-22 kb \nrange, which is shorter than typical HMW DNA extractions.  \n4. Fragmented DNA clean up: After shearing, it is important to perform a clean up to \nremove low molecular weight (LMW) DNA as well as compounds and inhibitors that \nmay have co-extracted with or bound to DNA to achieve good sequencing results. \n5. Hi-C: Samples are crosslinked to preserve the 3D structure of the genome, digested \nwith restriction enzymes, biotin-labeled, and proximity-ligated before short read \nsequencing. \n6. RNA extraction: RNA of sufficient quantity and quality for genome annotation is \nextracted and sequenced. \nWe have adapted and further developed protocols for the steps above from a wide range of \nprimary sources. These protocols have been written in a modular way so they can each be \nused in conjunction with one another, depending on the taxonomy, tissue type and mass of \nthe sample. They have all been published on protocols.io [5] in the Sanger Tree of Life \nWorkspace [6], where we will continue to publish new protocols as we develop and deploy \nthem. We encourage people who modify or improve these protocols to “fork” them on \nprotocols.io and make them publicly available to the wider biodiversity genomics community. \nThe datasets that have been produced during this work are provided as supplementary \nmaterial [4], with one file containing data pertaining to the DNA extraction results, one to the \nDNA fragmentation results, and another one to the RNA extraction results as well as a data \ndictionary to facilitate interpretation. These files provide a more detailed view into the \nperformance of various protocols on a wide range of species (e.g. there are nearly 5000 \nspecies in the DNA extraction results).  As work continues, access to the ever growing data \nset has been made available via a searchable online ‘Portal’ at \nlinks.tol.sanger.ac.uk/datasets/tol-lab-data. All statistical analyses presented were performed \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n4 \nusing the statistical programming language R (Version 4.4.1 [7]), with data visualisations \nsupported by Tableau [8] software (Version 2024.2.1).  \nThe typical sample follows a four step path for HMW DNA extractions and processing \n(Figure 1), with samples branching off for Hi-C and RNAseq. First, a sample is examined and \nweighed and the taxonomy is noted using the information provided by the collector. Based \non these features it is then directed into one of the three homogenisation protocols. The \noutcome of each of these protocols is three samples per species; two  ‘tissue prep’ samples \nthat can be directed toward any HMW DNA or RNA extraction protocol, and another sample \nto enter the Hi-C protocol. The processing of tissue to enter our Hi-C protocol differs \ndepending on taxonomy (described in Table 1). Separately, HMW DNA is extracted from the \nprepared sample. Currently, we have ten HMW DNA extraction protocols, and one \npre-extraction treatment each one optimised for different taxonomy and tissue types, \ndetailed further below. All of our protocols are version controlled with the version number in \nthe document name. Retired versions remain available on protocols.io but we advise using \nthe most recent version number for any given protocol.  \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n5 \n \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n6 \nFigure 1. A tube map of Tree of Life protocols for HMW DNA extraction and \nprocessing.  \nThe publicly available ToL protocols for HMW DNA extraction and processing are \ncategorised into the four numbered steps typically required to generate long read data for a \nhigh quality genome assembly. Each box here refers to a protocol  available in the \nprotocols.io Sanger Tree of Life Workspace and also linked to the Earth BioGenome Project \nWorkspace. The current best practice is indicated by the route taken by samples from \ndifferent taxonomic groups, shown by the coloured ‘tube lines’, and decision points to mark \nentry into these lines are discussed in the relevant taxonomic sections of this manuscript. \nThe group ‘Other Metazoa’ includes mostly marine non-Chordata and macroalgae, and \nwithin ‘jellies’ are jellyfish and ctenophores. \nThe output of each of these HMW DNA extractions protocols is a sample that can be \nfragmented using either of two methods, the selection of which is dependent on the quality \nand quantity of DNA in the sample, and the intended PacBio library type. There are two main \ntypes of long read library preparation: the conventional Low Input HiFi library (LI PacBio) and \nthe amplification-based Ultra Low Input HiFi library (ULI PacBio). The LI PacBio approach \nrequires at least 500 ng of 12-22 kb fragment size DNA per Gb of genome (i.e. a 2 Gb \ngenome would require at least 1 µg of sheared DNA). The ULI PacBio approach requires a \nshorter fragment size of around 10 kb to enable successful amplification, but can be \nsuccessful with as little as 20 ng of sheared DNA for smaller (< 1 Gb) genomes.  \nGiven these different input quantity and molecule length requirements, once we know the \nyield of DNA we have achieved in the HMW DNA extraction and its initial profile prior to \nshearing, together with an understanding of the predicted or known genome size obtained \nvia GoaT (Genomes on a Tree database; https://goat.genomehubs.org/) [9], we choose the \nappropriate shearing approach. We use g-TUBES (Covaris, Woburn, MA) for ULI libraries \nand the Megaruptor (Diagenode, S.A.) for LI libraries. The output of these fragmentation \nmethods can be submitted to either of the two clean up protocols depending on the scale of \nthe operation (manual [10] or automated [11]), both of which are Solid Phase Reversible \nImmobilisation (SPRI) [12] methods.  \nFinally, RNA extraction is carried out on a separate tissue aliquot from the same species and \nwhere possible, the same organism. We deploy either a manual TRIzol protocol [13] or an \nautomated MagMAX mirVana protocol (Thermo Fisher Scientific, UK) [14].  \nThe modularity of these protocols allows for flexibility and a high throughput, while \nmaintaining a standardised workflow. Having processed thousands of samples through these \nprotocols, we have been able to monitor successes and failures that are both obvious (e.g. \nthey yielded insufficient quality or quantity of DNA) as well as less obvious (e.g. where the \nDNA passed QC but still failed to generate sequence data). In practice, monitoring outcomes \nacross diverse taxa has enabled us to generate reference level genomes at pace for those \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n7 \ntaxonomic groups that tend to yield good DNA and results with the protocols above, while \nalso highlighting taxonomic groups that do not proceed well and thus require further attention \nand R&D protocol development.  \nSample selection and data generation in an ideal world \nIn an ideal process, all data for a species’ genome assembly would be generated from the \nsame individual, avoiding issues of sequence diversity among individuals.  However, this is \noften not possible due to limited tissue availability. Assembly algorithms work best when all \nlong-read data for the contig assembly is produced from a single individual, therefore we \nalways aim to access sufficient DNA to preclude needing to start long read processes over \nwith a new individual. Generation of Hi-C and RNAseq data from the same individual as was \nused for long read data generation is optimal, but for these processes, using different \nindividuals is viable. When two or more different individuals must be used for data \ngeneration, ideally data from long read and from Hi-C should be from the same sex such that \nthe Hi-C data represents the full complement of chromosomes present in the long read data. \nAny individual can be used to produce transcriptomic data, and in instances where several \nindividuals are available it is possible to start all of these lab processes in parallel. \nIndividuals should be selected bearing in mind their biology, e.g. the heterogametic sex and \nnon-polyploid samples are preferred following EBP guidance [15].  When specimens are \nlarge enough for dissection, or where multiple tissue types are available for a species, \ndifferent tissues can be selected for different processes. For example, in insects, we would \nusually generate long read data and Hi-C from the head and thorax, and only use the \nabdomen for RNAseq if necessary. This avoids sequencing the microbiome present in the \ngut, and, in parous females, any sperm or embryos present in the reproductive tract. Our \ngeneral rule is to avoid tissues that might contain organisms in addition to the target species. \nWhile it is interesting to assemble the cobionts in a sample, the additional sequencing data \nrequired, and the complexity of the subsequent assembly task argues against these tissues \nas sources for genomic DNA isolation.  \nThe amount of long read data required for genome assembly is dependent on genome size \nand is typically described in terms of coverage (e.g. 25x coverage of a diploid 1 Gb genome \n= 25 Gb of data required to give 12.5x coverage per haplotype).\n These calculations increase \nwhere polyploidy is present, as 12.5x coverage per haplotype is the minimum required. For \nthis reason, the predicted haploid genome size and ploidy for a species is retrieved from \nGoaT [9] to help determine the initial amount of sequencing required. For most species a \ndirectly measured genome size is not available and estimates from an average of the \nnearest taxonomic neighbours are used. Whilst these estimates can be inaccurate, they \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n8 \nprovide a reasonable starting point for sequencing efforts, which can be adjusted based on \nk-mer based genome size and ploidy estimates obtained from initial data. \n \nFigure 2. Process flow from species samples to the full data set required for genome \nassembly depending on tissue availability.  \nIdeally, one individual specimen should provide tissue for generation of all data types. For \nmany smaller organisms this is not possible, and a second or third individual may be \nrequired (indicated as individuals 1, 2 and another). Importantly, all long-read data for the \ninitial contig assembly must be produced from one individual. If insufficient coverage is \nachieved from initial sequencing, this needs to be topped up with additional data generated \nfrom DNA from the same specimen. If there is very little DNA remaining, it may be possible \nto make a ULI library from what remains. Otherwise, long read data generation must start \nafresh from a new individual. The stated coverage follows the recommendations of the Tree \nof Life assembly pipeline at the time of writing. \nSample Preparation  \nMost samples are provided as small pieces of tissue, cell culture pellets, or whole small \norganisms, snap frozen at collection in 1.9 mL FluidX (barcoded) tubes, transported and \nstored at -70°C, in line with EBP guidance [15]. Where tissue is abundant, such as vascular \nplants, larger tubes (7.6 mL) are used to collect as much tissue as possible without \ncompromising the integrity of the tissue. Once a sample has been selected for work in the \nlaboratory, a process is followed with the aim of normalising the biologically diverse samples \nas much as possible, resulting in the production of a tube containing sufficient material for \nthe next downstream process. The ideal amount of starting material is usually 25 mg for \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n9 \nanimals, protists and fungi and 50 mg for plants. Despite the fact that many organisms weigh \nless than 25 mg in total, we progress these through the protocols and are often successful. \nAll samples are weighed and divided based on their taxonomy, tissue type and size/mass of \nmaterial for disruption following the sample triage protocol [16]. Tissue homogenisation is a \ncrucial step prior to HMW DNA extraction. We have used the Powermasher (Nippi, Japan), \ncryoPREP (Covaris, Woburn, MA) and FastPrep-96 (MPBio, CA) at scale following the \nguidelines set out in Table 1.   In general, smaller samples are weighed and then \npowermashed [17] in the extraction lysis buffer at room temperature. The benefits of this \nmethod are the ability to adapt the duration of the treatment to the requirements of the \nsample structure, and directing the disruption toward different parts of the sample as it \ndisrupts, i.e. concentrating on more resistant pieces of tissue as they become apparent \nduring the process. Importantly, there is no loss of tissue since the process occurs within the \nlysis buffer, and all material is immediately put into nucleic acid extraction without any tube \ntransfer or pipetting. The drawback of this technique is its low throughput nature, with each \nsample requiring individual powermashing.   \nHomogenisation at extremely low temperatures can be achieved using a pestle and mortar, \nand liquid nitrogen. This approach is inherently low throughput and it can be hard to avoid \ncross-contamination of samples through residual tissue on instruments. The cryoPREP \ninstrument [18] (Covaris, Woburn, MA) solves the issue of cross contamination. Samples are \nplaced into proprietary bags (TissueTUBEs) made of material resistant to extremely low \ntemperatures and force (Figure 3). The whole bag containing only the tissue sample (no lysis \nbuffer) is submerged in liquid nitrogen and then placed on the machine to be smashed \nbetween metal plates. The cryoPREP can be used repeatedly on the same sample, and the \nstrength is adjustable. Therefore, the process can be continued until the sample is reduced \nto a fine powder.\n  The bag can be repeatedly submerged in liquid nitrogen between \npulverisations, maintaining low temperature and preventing degradation of nucleic acids \nwithin the sample due to the action of endogenous nucleases. We note that repeated \nprocessing can become labour intensive on the cryoPrep, and when processing a large \nnumber of samples maintaining the cold temperature is a challenge. Additionally, the \nproprietary bags cannot be reused and add significant cost to sample prep. \n \n \n \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n10 \n \nFigure 3. TissueTUBE assembly for the cryoPREP.  \n(a) Left to right: Covaris TT01 TissueTUBE, TT01/1.9 mL FluidX adapter and 1.9 mL FluidX \ntube. (b) assembly with sample in place. The sample is stored within the FluidX tube and \nonce all parts are assembled, the assembly is inverted to allow the sample to move into the \nTT1 bag. The TT1 Adapter shown was 3D printed within the Sanger Institute. \nThe FastPREP-96 bead beating approach is useful for plant tissue disruption [19]. FluidX \ntubes containing snap-frozen dry tissue samples (50-90 mg) are selected, and 3 x 3 mm \nstainless steel grinding balls added. Up to 48 of these sample tubes are then assembled into \na rack which is submerged in liquid nitrogen to cool, and then carefully lifted allowing any \nexcess to drain. This chilled rack of tubes is then placed on the FastPrep-96 instrument to be \nshaken at 1600 rpm for 30 seconds. This process, including the submersion, is repeated \nthree times, after which all samples are reduced to a homogenous powder. There is no need \nto remove the beads from the tube before starting the lysis process, and performing this in \nthe same tube prevents loss of tissue. Two racks of tubes can be processed in parallel, \nenabling a throughput of 96 samples. This technique has proven extremely successful for \nplant tissue disruption and is showing promise for other organisms and tissue types. For a \nset of test species (oak, ladybird, snail, yeast, and marine fungus), tissue disruption with \nFastPrep-96 achieved similar results as cryoPREP and powermasher, but with the \nprocessing advantage of scale and a standardised approach.  \nWhile figure 1 shows the ideal sample weight and homogenisation method for our HMW \nDNA protocols, Table 1 shows the ideal preparation for RNA and Hi-C, each split by \ntaxonomic grouping. The ideal preparation of samples is dependent on the process for which \nthe sample is intended, which can make standardising decisions difficult when balanced \nagainst the diverse nature of the samples. For some sample types, such as protists, the \nideal method of disruption has not yet been ascertained so the current best practice is \nshared here. Details are provided in the taxon specific sections of this manuscript to further \ndescribe the observations that can be drawn from our work so far. For other groups such as \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n11 \nchordates, arthropods and plants, our methods are robust and routine but the FastPrep-96 \napproach may replace powermashing and the cryoPREP and testing is underway.  \nTable 1. Sample Preparation Guidelines for RNA Extraction and Hi-C for each \ntaxonomic group \nSample \ntaxonomy \nIdeal input for RNA \nextraction Ideal input for Hi-C \nArthropods <10 mg powermash \n>10 mg cryoPREP \nwhole head or up to 20 mg tissue, \nno homogenisation \nChordates <10 mg powermash \n>10 mg cryoPREP \nup to 20 mg tissue, \nno homogenisation \nPlants ⋟10 mg bead beaten 50 mg bead beaten \nFungi ⋟10 mg bead beaten 50 mg bead beaten \nProtists ⋟10 mg bead beaten 20 mg bead beaten \nOther metazoa \nand macroalgae \n<10 mg powermash \n>10 mg cryoPREP \nup to 20 mg tissue, \nno homogenisation \n \nHigh Molecular Weight DNA Extraction \nFollowing the sample preparation and appropriate disruption process, samples progress to \nHMW DNA extraction. The ideal input weight and disruption method for different sample \ntypes is shown in Figure 1. A proportion of samples do not meet the minimum mass criteria, \nand it is therefore not possible to standardise the input for these samples. This does not \nprevent these samples from entering the process and contributes to our understanding of \nperformance outside the ideal parameters. \nIn order to minimise the number of different extraction protocols we use, our approach has \nbeen to first test a sample from every species using one standardised protocol. Samples that \npass well through this extraction protocol will go forward to produce sequence data, and \nthose that fail highlight the species groups that require further investigation. We use  the \nQiagen (Hilden, Germany) MagAttract HMW DNA extraction method [20] as our default first \nprotocol due to the track record seen in the laboratory and the ability to automate on the \nKingFisher Apex [21]\n. For plant samples, an accompanying Plant MagAttract protocol [22] \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n12 \nhas been implemented. Because the samples received are diverse, if a first extraction fails a \nsecond attempt is made with the same protocol. This allows for the selection of a new \nindividual, or a different tissue type, and on many occasions this results in a successful DNA \nextraction. This may be due to many factors including individual differences within species, \nor to factors relating to the sample collection and preservation.  A 10 µL molecular voucher \n(aliquot) of every DNA extraction performed is retained for deposition to museums, and in \nsome cases, this voucher can be used in part for “top up” when slightly more long-read \ncoverage is required.  \nStandard quality metrics are collected for each sample after DNA extraction. Nucleic acid \nquantity is measured using the Qubit® dsDNA assay (Thermo Fisher Scientific, UK). We \nalso assess DNA purity through spectrophotometry using a Lunatic spectrophotometer \n(Unchained Labs, Pleasanton, CA.). We measure the ratio of absorbances at 260 nm:280 \nnm, which is ideally ~1.8, and 260 nm:230 nm, which is ideally between 2.0 and 2.2. \nDeviation from the optimum for either of these measures indicates the presence of \ncontaminants in the extraction, for example phenols or carbohydrates, that may interfere with \ndownstream processes. The fragment length distribution of the HMW DNA is assessed using \nthe FemtoPulse System (Agilent Technologies, Santa Clara, CA.) and their Genomic DNA \n165 kb Kit. This pulsed-field capillary electrophoresis system measures concentration \n(through spectrophotometry) and length (based on retention time relative to standards) of the \nextracted DNA, and provides accurate sizing of fragments up to 165 kb. Above this size, \nultra HMW fragments are visible, but the sizing is not accurate.  \nThe quality of DNA extracted from samples is highly variable, resulting in diverse \nFemtoPulse profiles. To assess the traces routinely and standardise decision making \nbetween users, we developed a categorisation system. We defined five profile classes: \n“LMW DNA”, “smear bulk <50 kb”, “smear bulk >50 kb”, “HMW band plus smear” and :HMW \nband”. Model profiles representing each of these categories is shown in Figure 4. Over time, \nas we become more familiar with certain taxonomic groups (such as Lepidoptera), some \nsamples  are processed as scale without the routine labelling of profiles [4]. For more \nchallenging sample groups still under active R&D, different aspects can be noted using a \nmulti-select approach - for example “HMW band” and “LMW DNA” could both be selected for \none sample [4]. \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n13 \n \nFigure 4. Representative FemtoPulse molecular weight profiles of five DNA \nextractions, modelling the categories used within the Tree of Life.  \nThe X axis shows fragment size. Ideally most DNA would sit above 50 kb. This is \ndemonstrated in the example traces for categories ‘HMW band’ (black), ‘HMW band plus \nsmear’ (blue) and ‘Smear bulk >50 kb’ (red), where peaks in the trace are visible at \napproximately 160 kb. The profile ‘Smear, bulk <50 kb’ (yellow) is common and can be \nprogressed best when it is possible to remove the smaller fragments - ideally removing \neverything below 10 kb. Finally, the category ‘LMW DNA’ (green) is a failure for downstream \nlong read sequencing. \nIn case of good DNA quality but low yield on first extraction, there are two options. In some \ncases, it may be possible to extract again from the same specimen and pool samples in \norder to achieve the required quality and quantity of DNA. If this is not possible, samples \nwith < 1 Gb predicted genome size yielding over 100 ng of HMW DNA sufficient quality are \nrouted toward ULI library prep, aiming for >20 ng of DNA after shearing and SPRI cleanup. \nSamples with <100 ng of DNA may be progressed along this route if there is no option to \nrepeat the DNA extraction, i.e. there is no tissue remaining. With the new PacBio Ampli-Fi \nkit, these thresholds are likely to be 1 ng per 3 Gb genome size.  For species where even \nthis quantity of DNA is not achievable, picogram-input methods like PiMmS are available \n[23]. The ULI option is restricted to species with a genome size of <1 Gb due to the impact of \namplification bias and the resulting poor coverage of specific genomic regions that affects \ndownstream assembly quality.  \nWe have introduced a 0.45X SPRI step directly after extraction to remove DNA fragments \n<10 kb [24]\n.  Because the FemtoPulse trace shows relative absorbance normalised to the \nmaximal value, it can be difficult to assess the fragment distribution for samples with \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n14 \nsignificant amounts of LMW DNA. The removal of shorter fragments via the SPRI cleanup \nenables more accurate analysis of the HMW DNA profile, demonstrated by analysis of \nMartes martes (pine marten) samples (Figure 5). Improving the analysis and decision \nmaking process at this point enabled a reduction in the quantity threshold for passing \nextractions through to fragmentation. \n \nFigure 5. Overlaid FemtoPulse profile for DNA extractions from Martes martes heart \ntissue (Mammalia; pine marten).  \nThe overlaid profiles show the impact of performing a 0.45 SPRI after DNA extraction. The \nblack trace (manual DNA extraction with no SPRI), shows very little detail due to the large \nLMW peak, whilst the blue trace (automated DNA extraction with a SPRI) reflects the profile \nof the remaining DNA with significantly more detail allowing for informed decision making. \nThe overall success rates of samples within laboratory extraction processes can be broken \ndown by taxonomic group (Figure 6). Overall we find that chordates and plants progress \nwell, showing the highest HMW DNA extraction pass rate of 96%  (91.2 Pass, 4.0% Pass \nULI, 1.1% Pooling), and 91% (84.3% Pass, 5.5% Pass ULI, 0.9% Pooling), respectively. The \nnumber of species within the arthropods dwarfs the other sample groups, with 2373 species \nhaving been processed with a total pass rate of 85%. The highest HMW DNA extraction fail \nrate is observed in fungi, at 34.2%.  \n \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n15 \n \nFigure 6. Success rates of DNA extractions per species from six taxonomic groups.  \nThe bar chart summarises the DNA extraction success per species across the six taxonomic \ngroups. The results are categorised as: Pass – DNA sufficient for sequencing achieved; \nPass ULI - DNA sufficient for sequencing with ultra low input achieved; Pooling - two DNA \nextractions were pooled to meet QC threshold; Fail - extractions have failed to provide \nsufficient quality and/or quantity DNA to proceed. The results represent the best DNA \nextraction outcome per species, determined using the hierarchy: Pass > Pass ULI > Pooling \n> Fail. The number to the right of each bar represents the total number of species processed \nwithin each taxonomic group, and the percentages inside the bars indicate the proportion of \nspecies in each category. \nHMW DNA fragmentation \nMegaruptor fragmentation \nThe DNA fragmentation protocol [25] using the Megaruptor 3 (Diagenode, S.A.) instrument, \nforces extracted DNA through a single-use hydropore connected to a syringe at a controlled \nrate, enacting mechanical shearing upon the DNA within the solution. The system can be \nused at various speeds, producing fragments of different median length. For LI PacBio \nlibraries the sheared DNA lengths should be in the range of  12 - 22 kb, a tight peak (e.g. \nmost DNA at 18 kb). The main challenge with Megaruptor shearing is pore blocking, where a \nsyringe fails to pull the sample back through the hydropore after the initial pass. This occurs \nwithout warning and is largely unpredictable, although it is more common with visibly viscous \nDNA extractions. One way to overcome blockage is by diluting samples and running portions \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n16 \nof the sample through the syringe over multiple stages, or transferring the extract to a \ndifferent hydropore syringe designed for viscous samples (Cat. No. E07020001). \nOccasionally a sample does not shear fully during this process, which is clear only after \nobserving the fragment size distribution on the FemtoPulse. A second attempt at shearing is \nthen made which typically completes the shearing, however, if a minority of longer fragments \npersist the samples are still progressed to library prep where they will be removed in later \nsize selection processes. \ng-TUBE fragmentation \nThe incorporation of DNA amplification into the ULI library prep method allows for a \nsignificantly lower input (as little as 20 ng post shearing), and a shorter fragment length of \n9-11 kb.   This fragment length can be routinely and reliably achieved using a g-TUBE \n(Covaris, Woburn, MA) with our standard protocol [26], which relies on shearing due to \nforcing the DNA through a narrow aperture membrane.  While Megaruptor shearing can also \nbe used to generate the smaller fragments needed for ULI, g-TUBEs are preferable because \nof faster processing time and more reliable output. Unlike the Megaruptor syringes, the \ng-TUBE only requires the use of a microcentrifuge for shearing. Occasionally a sample fails \nto pass through the g-TUBE and requires a second spin, but this is the extent of \ntroubleshooting required for this method.  \nFragmented DNA clean up  \nAfter fragmentation with either method, the DNA is cleaned and concentrated again using \nSPRI beads, either manually [10] or automated on the Kingfisher APEX [11].  This process \nboth purifies the DNA and removes shorter fragments. Following this, the DNA is evaluated \nusing Qubit and Lunatic spectrophotometry and FemtoPulse electrophoresis. Samples \nmeeting the criteria for LI or ULI sizes and yields progress through to PacBio library \npreparation. \nThe QC results inform a decision-making process, resulting in an output of either a ‘Pass’ or \n‘Fail’ post-shearing. The pass rates for shearing are relatively high – from 89% for chordates \nto 67% for arthropods (calculated from data in [4]). The ULI path provides a route for low \nyield samples, and also those that fail because their short fragment length is not ideal for LI. \nProtist samples exemplify this, with a pass rate of 20% in LI fragmentation and 54% in the \nULI method (calculated from data in [4]). DNA samples that fail in both shearing options \nwould require another DNA extraction event. Since the introduction of a SPRI after DNA \nextraction and before fragmentation, there has been an increase in the fragmentation pass \nrate. Samples that would previously have failed at this point are now removed earlier in the \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n17 \nprocess, meaning that less time is spent processing samples that are not suitable for \nsequencing. \nLong Read Sequencing  \nUltimately, the success or failure of a HMW DNA extraction for the purposes of long read \nsequencing is judged by sequencing yield, read quality and read length. For the PacBio \nplatforms, raw sequencing reads from circular library molecules that contain multiple reads \nacross both strands of the insert DNA are automatically error corrected to generate circular \nconsensus sequencing (CCS) output reads of high per-base quality. On the Sequel IIe \nplatform, a CCS yield over 20 Gb is considered good, while a yield of 10-20 Gb is not ideal \nbut still potentially adequate depending on genome size. Finally, a yield below 10 Gb is \nconsidered poor and a target for improvement. The Revio platform is designed to yield three \ntimes this output, and performance is judged in line with this. With the aim of producing \n12.5x coverage per haplotype, library multiplexing on the Revio is advised unless work is \ntaking place on large genome organisms (e.g. > 2 Gb) to avoid overproduction of data.  \nIf a sample is sequencing well but has not reached the required coverage, a ‘top-up’ of data \nfrom the same genetic individual is needed. Recent work in this area has shown that the \nlongevity of both LI and ULI libraries is greater than had been anticipated, with examples of \nboth surviving storage at -70°C for over 9 months before performing equally well on a \nsecond run. Where no library or DNA remains, the DNA voucher (a 10 µL aliquot taken from \nall extractions) can be  valuable for ULI prep. \nSmall diploid genomes (<0.5 Gb) can reach 25x coverage from low cell yield (i.e. <10 Gb), \nbut this poor performance remains a target for improvement. As the data are collected and \naccumulated, trends in lower CCS yield for different taxonomic groups become indicators of \nR&D need.  \nThe data yield required for successful assembly is based on the genome size of the target \norganism within a sample. However, as samples are collected from wild environments other \nspecies are often present within a sample (e.g. the microbiome, pathogens and parasites). \nThe bioinformatics pipelines in place to process data are capable of filtering out data that \nderives from these “cobionts”. In many cases the reads from non-target organisms are \nsufficient to generate cobiont genome assemblies as a by-product of the attempts to \nsequence the target species. However, occasionally non-target species can be present in \nsuch abundance as to prevent the sequencing of the target species or make it too costly to \ncontinue sequencing to achieve required coverage for the target species. This scenario may \narise from cultured species that require the presence of other organisms to grow, for \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n18 \nexample protists that feed on bacterial species. These cases present a challenge for \nreference genome pipelines and purification of the sample upstream of DNA extraction is \nrecommended. \nThe output from the standardised workflow described results in a range of CCS yields that \nshows variation for each taxonomic group for both ULI and LI submissions (Figure 7). We \ncompared data production for all libraries run as one species’ library per cell on both PacBio \nplatforms, Sequel IIe and Revio. To present standardised results, data from multiplexed \nsamples were not included. Comparing submission types, the average yield of runs on the \nRevio instrument for LI libraries ranged between 48 and 69 Gb, and from 65 to 74 Gb for ULI \nsubmissions. On the Sequel instruments, the yield ranged between 17 and 25 Gb for LI \nsubmissions, and 19 to 24 Gb for ULI submissions. Arthropod species had a fairly consistent \nyield regardless of instrument or library preparation techniques. However, fungi had more \nvariable yields, with low average yields of 17 Gb with the LI library prep method but much \nimproved average yield of 23 Gb when ULI libraries were sequenced, on the Sequel IIe \nplatform. Our experience with sequencing fungi on the Revio is limited but in line with the \nSequel IIe, with approximately three-fold higher yields for LI libraries (50 Gb). Most fungal \nspecies are now directed to the ULI library pipeline because of low DNA yields. Overall, ULI \nlibraries show less variation in yield within each taxonomic group than LI libraries, as would \nbe expected for PCR-amplified DNA when compared with native. \n \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n19 \n \nFigure 7. Distribution of CCS yield per taxonomic group. \nThe distribution of the CCS yields across taxonomic groups is shown for each instrument \ntype (Revio and Sequel) and library type (LI: Low Input; ULI: Ultra Low Input). Each box \nspans the interquartile range, with the lower and upper edges indicating the 25th and 75th \npercentiles, respectively. The horizontal line within each box represents the median CCS \nyield value per taxonomic group. Whiskers extend to the furthest data points within 1.5 times \nthe interquartile range, while data points outside of this range are considered outliers. Data \nanalysed included all single specimen libraries (i.e. non-multiplexed) sequencing runs, \nregardless of extraction method. The label under each subplot refers to n, the number of \nsequencing runs within each sub-category. \nIn addition to the CCS yields varying across species, they can also vary within a species. For \nexample, in the case of the newt, Lissotriton vulgaris, which has a large genome (24 Gb), a \nsingle library was made from DNA extracted from muscle. This library was run on seven \nSequel IIe cells, with CCS yields ranging from 18 to 32 Gb per cell. This variation and \nunpredictability in yield presents challenges for scaling production. The recent introduction of \nSPRQ preloading normalisation on the Revio system will, we hope, reduce this unwanted \nvariability. To more accurately target required coverage based on predicted genome size, \nmany libraries are now multiplexed, with 2, 4 or 8 libraries run on one Revio cell in parallel. \nIdeal plexing in terms of molarity, taxonomy and fragment length is not possible as each \nspecimen varies in its final library insert size profile.  \n \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n20 \nHi-C Library prep and sequencing \nA tissue aliquot to be used for Hi-C library prep is created for each species during the \nsample preparation process. The guidelines for the amount and disruption of each sample \ntype is shown in Table 1. We use the Arima Genomics (Carlsbad, CA, US) Hi-C v2 kit. Three \ndistinct fixation protocols are used depending on the taxon group. For animals, we follow the \nArima high coverage kit recommendation for animal tissue, which involves fixation by 2% \nformaldehyde for 20 minutes in TC buffer (Arima). For plant and algal samples, we follow the \nArima high coverage recommendation for mammalian cell lines, which involves nuclei \nisolation using the Qiagen Qproteome Cell Compartment kit followed by fixation with 2% \nformaldehyde for 10 minutes in 1x PBS buffer. Finally, for sponges (Porifera) that have been \nprepared via the “squeeze” method [27] to create a cell pellet, we carry out fixation with 2% \nformaldehyde for 10 minutes in 1x PBS buffer.  \nHi-C is performed according to manufacturer’s recommendations except that the number of \nPCR cycles used in Illumina library amplification is directed by the DNA concentration post \nadapter ligation and streptavidin enrichment as measured using Qubit dsDNA high sensitivity \nkit (Thermo Fisher Scientific, UK), rather than determining amplification cycles by qPCR as \nin Arima QC2 procedure. The following PCR cycle guidelines are used: If >8 ng/µL DNA in \npost streptavidin enrichment quantification use 8 cycles of PCR; If >2 ng/µL DNA in post \nstreptavidin enrichment quantification use 10 cycles of PCR; If >0.5 ng/µL DNA in post \nstreptavidin enrichment quantification use 12 cycles of PCR; If >0.1 ng/µL DNA in post \nstreptavidin enrichment quantification use 14 cycles of PCR; For lower concentrations use \n16 cycles PCR. \nLibraries are sequenced using Illumina (San Diego, CA, US) short read technology on the \nNovaSeqX, 150 B paired end reads on the 25B flow cell. Libraries are multiplexed such that \n25x coverage per haplotype of the genome is aimed for for each sample. Grouping together \nsamples with genome sizes of <1.5 Gb, 1.5-2.5 Gb, and >4 Gb can be useful for achieving \ndesired plexing levels. \nRNA extraction \nFor all species we also extract and sequence mRNA to provide data for gene annotation. \nThese data would ideally be produced from several different tissue types for each species, \nas this is most beneficial for gene annotation, but this is often not possible due to small \norganism size or restricted number of tissues collected. Originally, a manual TriZol method \n[13] was applied that achieved a high success rate from an extremely wide range of \nsamples, typically using 25 mg of tissue.  Tissue prepared by either the cryoPREP or \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n21 \npowermasher can be used as input for this method, and the resulting yields were \nconsistently significantly greater than requirements for short read RNAseq (Illumina, San \nDiego, CA.). After isolation, any DNA remaining in the samples was removed using Turbo \nDNase (Thermo Fisher Scientific, UK) and RNA was checked for quality and quantity using \nthe Qubit RNA Broad Range Assay kit (Thermo Fisher Scientific, UK) and the Nanodrop. \nThis extraction method is ideal for a small number of samples, but the ergonomic issues and \nuse of hazardous substances are prohibitive for scaling up. We therefore switched to the \nMirVana (Thermo Fisher Scientific, UK) bead-based extraction protocol [14] and reduced the \namount of tissue input from 25 to 15 mg. All taxon groups score near to 100% extraction \nsuccess, with the exception of protists at 92% pass rate (calculated from data in [4]), \nmeaning a total RNA yield over the 100 ng input requirement for our standard library prep \nand sequencing process; Poly(A) RNA-Seq libraries constructed using the NEB Ultra II RNA \nLibrary Prep kit, following the manufacturer’s instructions, sequenced on the Illumina \nNovaSeq X instrument. For samples that fail, a different individual, different tissue type(s) \nand/or increased input amounts can be used in order to increase the RNA yield or quality \nobtained. Ultimately, RNA is highly dependent on the quality of the sample material provided \nand many failed extractions originate from samples not preserved in the ideal way. RNA \nextracted from several different organisms and tissue types using this method has also been \nsuccessful for long read RNA sequencing with the Kinnex (Pacific Biosciences, Menlo Park, \nCA.) methodology. \nTaxonomic specific considerations \nArthropods \nSmall arthropod species are often preserved as whole individuals, requiring several \nindividuals to complete the data required for an assembly (one for long read, one for Hi-C, \none for RNAseq). Larger arthropods are partitioned into different tubes, e.g. head, thorax, \nand abdomen each in separate tubes. Arthropods have an extraction pass rate of almost \n85% across 2374 arthropod species reported on here representing 1575 genera, 453 \nfamilies, and 52 orders (Figure 8). \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n22 \n \nFigure 8. Arthropod DNA extraction success metrics by order. \nThe bar chart summarises the DNA extraction success per species across Arthropod orders. \nThe results are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - \nDNA sufficient for sequencing with PacBio Ultra Low Input achieved; Pooling - two or more \nDNA extractions were pooled to meet QC threshold; Fail - extractions have failed to provide \nDNA of sufficient quality or quantity to proceed. The results represent the best DNA \nextraction outcome per species, determined using the hierarchy: Pass > Pass ULI > Pooling \n> Fail. The number inside each bar represents the percentage of species that have passed \nextraction within the orders, including Pass, Pass ULI and Pooling categories. To account for \nthe wide range in species counts, a logarithmic scale is used, and orders with fewer than five \nspecies are excluded from this visualisation but are available in the supplementary material \n(Figure S1). \nWhilst arthropods usually perform well at extraction they are not without challenges.   One \nchallenge is the disruption of small organisms with chitinous exoskeletons, such as \nAmphipoda where 71% of 24 species have failed extraction (Figure 8). Initial work to apply \nthe bead beating homogenisation method for these samples looks promising.  Another \nchallenge is small body size resulting in the most common failure being low HMW DNA yield. \nIf the genome size is appropriate, these samples can be successful with ULI.  \nHowever, ULI does not work well on its own for small organisms with large genomes.  \nJumping spiders (Salticidae) are an example of such a group, with small sized bodies, \ntypically 10-15 mg but ranging down to 3 mg, and large genomes of up to 10 Gb. A modified \nextraction protocol [28] with reduced volumes of buffers was successful in improving DNA \nyields per specimen.  \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n23 \nIsopods are another example of species with typically small body size (<1 cm) and large \ngenomes (e.g. Oniscus asellus 8.4 Gb). When tissue is restricted due to the organism size, \nand the DNA sequences poorly, as observed for isopods, reaching sufficient coverage is \nchallenging. Currently, this challenge is being addressed through combining LI and ULI \nlibrary types. This strategy minimises the impact of amplification biases present in the ULI \ndata, as regions of drop out are likely to be compensated by presence in the LI data. The \nlibraries produced from the ULI approach tend to sequence very well, as the DNA has been \namplified. The abundance of DNA in relation to any inhibitors present is also changed in \nfavour of high sequencing yields.  \nArthropod DNA tends to perform well in fragmentation processes as shown by the high pass \nrates (Figure 9).  \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n24 \nFigure 9. Arthropod fragmentation success by order. \nThe bar chart summarises the DNA fragmentation success per species across Arthropod \norders, subjected via the LI (left) or ULI (right) submission types. Species progressed under \nboth submission types are included in both bars. The results represent the best DNA \nfragmentation outcome per species and submission type, determined using the hierarchy: \nPass > Fail. The number inside each bar represents the percentage of species that have \npassed extraction within the orders. To account for the wide range in species counts, a \nlogarithmic scale is used, and orders with fewer than five species are excluded from this \nvisualisation but are available in the supplementary material (Figure S2). \n \n \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n25 \nPlants \nPlant samples are typically fresh leaf material that has been collected in relative abundance, \nsupported by the use of 7.6 mL tubes. Tissue availability is usually not a limiting factor for \nplants, other than particular taxonomic groups such as the Bryophytes. \nWe use the ‘Plant MagAttract’ [29] protocol routinely for DNA extraction protocol from all \nplant species (Figure 1). It is efficient at extracting HMW DNA from a wide range of species \nto an extent adequate for long-read sequencing. Plant samples that fail to produce \nsequenceable HMW DNA from the Plant MagAttract extraction protocol are processed \nthrough the Plant Organic Extraction (POE) protocol [30]. Species extracted with the Plant \nMagAttract v.4 protocol can result in a poor DNA profile that is significantly improved when \nthe same species is extracted with the POE protocol (Figure 10). The POE protocol is \nmid-throughput and requires more time and expertise in the laboratory, and for this reason it \nis employed only as a second-measure attempt for recalcitrant species. Work is underway to \nidentify prior to extraction which species would most benefit from proceeding directly to the \nPOE extraction method. \n \nFigure 10. Overlaid FemtoPulse molecular weight profiles of two DNA extractions \nfrom Thymus drucei (Lamiales; wild thyme) with two different DNA extraction \nprotocols.  \nThe trace from the Plant MagAttract extract from 67 mg tissue shows primarily a wide LMW \npeak at 5.8 kb, whereas the POE protocol extract from 65 mg tissue shows a strong peak at \n110 kb with slight smear down to 1.8 kb and further strong peak in the ultra HMW zone \n(>200 kb). \nWe have processed 998 species covering 63 orders, 166 families and 564 genera within the \nPlant taxonomic group, including 927 vascular (Tracheophyta and Streptophyta) and 71 \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n26 \nnon-vascular (Bryophyta and Marchantiophyta) species. High success rates have been \nobserved across the plant orders with 91% of all species extracted having passed through to \nsubsequent processes. However, many of the successfully extracted plant species have \nfailed at later stages including fragmentation, library preparation, or sequencing. Plant \nspecies that have failed twice at any of these points have been selected and processed \nthrough the POE protocol. The switch to use of the POE protocol is clearly of benefit to some \ngroups, such as the Saxifragales, where the pass rate changes from 56% with MagAttract to \n100% with POE (Figure 11).  \nFragmentation of DNA extracted from plant material is usually not problematic, with a \nsuccess rate of 82%, calculated using unique species regardless of the submission protocol. \nFor species for which both protocols have been used, the LI and ULI submission protocols \nhave had a success rate of 80% and 93% respectively (calculated from data in [4]). \nBryophytes have been challenging due to their low tissue availability as an individual is often \n<15 mg, whereas the usual input for plant DNA extraction methods is 50 mg. Modification of \nprotocols to minimise tissue loss and maximise DNA recovery have been successful, \ncoupled with the ULI library prep method as the genome sizes are often <1 Gb. \nA number of plant species remain challenging, with neither MagAttract nor POE DNA \nextraction protocols providing the required DNA yield or quality for long read sequencing. A \npre-lysis hypertonic sorbitol wash has been developed [31] to remove interfering chemical \ncontaminants present within the cytosol of plant specimens prior to lysis. Sorbitol is an \nosmotically active sugar alcohol capable of ‘drawing out’ the cytosol of plant tissues \nhomogenates without interrupting the nuclear membrane. When a sorbitol wash is \nsuccessful, a previously recalcitrant sample’s lysate should be absent of both viscosity, \nbrowning or other unfavourable characteristics. Initial results have shown that this protocol \ncan significantly improve the quality of DNA extractions, and the CCS yield.  \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n27 \nFigure 11. Plant MagAttract and Plant Organic Extraction (POE) DNA extraction \nsuccess metrics by order. \nThe bar chart summarises the DNA extraction success per species across Plant orders, \nextracted via MagAttract or Plant Organic Extraction protocols. Species extracted with both \nprotocols are included in both bars. Species extracted with protocols other than Plant \nMagAttract or Plant Organic Extraction are not represented. The results are categorised as: \nPass – DNA sufficient for sequencing achieved; Pass ULI - DNA sufficient for sequencing \nwith ultra low input achieved; Pooling - two DNA extractions were pooled to meet QC \nthreshold; Fail - extractions have failed to provide sufficient DNA to proceed. The results \nrepresent the best DNA extraction outcome per species and extraction protocol, determined \nusing the hierarchy: Pass > Pass ULI > Pooling > Fail. The number inside each bar \nrepresents the percentage of species that have passed extraction within the orders, \nincluding Pass, Pooling and Pass ULI categories. To account for the wide range in species \ncounts, a logarithmic scale is used, and orders with fewer than five species are excluded \nfrom this visualisation but are available in the supplementary material (Figure S3). \nFungi (including Lichens) \nMost fungal samples received had been cultured from samples collected in the field. The \nsamples arrived as cell pellets with low tissue mass and size presenting challenges for DNA \nextraction. Mycelium samples can also be challenging due to low density of nuclei in the \ntissue. DNA extractions for these fungi have been of consistently low yield, and often of low \nquality in terms of fragment length resulting in a high ‘Fail’ rate (Figure 12). On occasions \nthat high yields have been achieved, the sequencing of these samples has been poor. Given \nthe typically small genome size of fungi (typically ~40 Mb) and the relatively low yields \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n28 \nachieved, the ULI library prep method has been the standard option for fungi samples. The \nideal amount of DNA to start this process is 100 ng, although samples with >25 ng of DNA \nwithin the ULI fragment range are progressed. The optimised automated Plant Magattract \nprotocol (v.4) [29] has provided increased DNA yield and improved FemtoPulse profile and is \ntherefore now the protocol of choice. Post-extraction, the majority of fungi samples are \ndirected toward the g-TUBE fragmentation method [26], this is an efficient and effective \nprocess resulting in a high pass rate for unique fungi species of 79% (calculated from data in \n[4]). The amplification in the ULI library generation process is being utilised here to aid \nsequencing, rather than accounting for a very low DNA input amount, as native fungal DNA \noften produces poor sequencing yields. To optimise the amplification process for this \npurpose,  we are currently exploring a reduction in the number of PCR cycles and trialling \ndifferent enzymes. \nFigure 12. Fungi DNA Extraction success metrics by order \nThe bar chart summarises the DNA extraction success per species across Fungi orders. The \nresults are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - DNA \nsufficient for sequencing with ultra low input achieved; Pooling - two DNA extractions were \npooled to meet QC threshold; Fail - extractions have failed to provide sufficient DNA to \nproceed. The results represent the best DNA extraction outcome per species, determined \nusing the hierarchy: Pass > Pass ULI > Pooling > Fail. The number inside each bar \nrepresents the percentage of species that have passed extraction in any way, not those that  \nfailed. To account for the wide range in species counts, a logarithmic scale is used, and \norders with fewer than five species are excluded from this visualisation but are available in \nthe supplementary material (Figure S4). \nChordates \nThe routine processing of chordates is highly efficient, resulting in a 96% pass rate for \nspecies at DNA extraction (Figure 6). The DNA extraction status for orders within the group \nreveals that this success is general, with no clear trends (Figure 13). Chordates were \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n29 \nprocessed at 89% pass rate (calculated from data in [4]) at DNA fragmentation, when results \nfor both protocols (g-TUBE and Megaruptor)  are combined at the species level.  \nThe collection of chordate samples is legally and ethically challenging, and due to this, a \nsignificant number of samples from chordate species are provided from specimens that have \nbeen found dead, or small tissues collected from live individuals. The use of preservative \nsolutions in lieu of snap-freezing is common for chordate samples. To ensure proper fixation \nit is recommended that a 1:10 ratio of tissue:fixative volumes is used. Ear punches and \npunch biopsies are a very useful form of non-lethal sample collection for chordate species; \nthough not the most successful for DNA extraction, disruption via the cryoPREP helped \nmaximise DNA yield and quality. Fish and bird blood are amongst the best performing \ntissues for HMW DNA extraction and are processed using the Nanobind HMW DNA \nextraction - nucleated blood protocol [32]. This manual protocol requires inputs ranging from \n5-25 µl of nucleated blood, flash frozen or stored in ethanol at -80°C, from birds, fish or \namphibians, and yields around 10-40 µg of HMW DNA. An automated version of this \nprotocol [33] permits high throughput extraction of nucleated blood samples. Samples \ncollected at necropsy often contain only degraded DNA. In these situations, it is possible to \nextract and then perform a stringent 0.45X SPRI to remove any remaining RNA or LMW \nDNA, and progress directly to library preparation and sequencing, bypassing shearing, if the \nfragment size profile is already degraded.  \n \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n30 \nFigure 13. Chordate DNA Extraction success metrics by order \nThe bar chart summarises the DNA extraction success per species across Chordata orders. \nThe results are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - \nDNA sufficient for sequencing with ultra low input achieved; Pooling - two DNA extractions \nwere pooled to meet QC threshold; Fail - extractions have failed to provide sufficient DNA to \nproceed. The results represent the best DNA extraction outcome per species, determined \nusing the hierarchy: Pass > Pass ULI > Pooling > Fail. The number inside each bar \nrepresents the percentage of species that have passed extraction in any way, not those that  \nfailed. To account for the wide range in species counts, orders with fewer than five species \nare excluded from this visualisation but are available in the supplementary material (Figure \nS5).  \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n31 \nProtists \nThis taxonomic group poses a unique challenge due to the diversity of the species it \ncontains, from microalgae to dinoflagellates, the nature of cell walls and exoskeletons, and \nthe relative size of both individuals and genomes [34]. \nProtist samples have typically been provided as cell pellets from cultured strains. Because of \nthe diversity of culture conditions required by different species this results in pellets with a \nwide range in mass and cell number per mg weight. This diversity makes it hard to \nstandardise input amounts for DNA extraction. Although not yet fully optimised, the currently \npreferred process begins with a cell pellet of 50 mg, disrupted with the cryoPREP. DNA is \nextracted using the Plant MagAttract v4 extraction protocol [29,35]. The use of the \ncryoPREP provides increased yields compared to power mashed samples, and the adoption \nof Plant MagAttract v4 lysis also increases yield due to the better lysis of cell wall structures \npresent in many protists and microalgae. These measures have resulted in an overall \nsuccess rate in extraction for protists of 83% (Figure 6) with an uneven distribution between \norders (Figure 14).  \nFigure 14. Protist DNA extraction success metrics by order \nThe bar chart summarises the DNA extraction success per species across Protist orders. \nThe results are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - \nDNA sufficient for sequencing with ultra low input achieved; Pooling - two DNA extractions \nwere pooled to meet QC threshold; Fail - extractions have failed to provide sufficient DNA to \nproceed. The results represent the best DNA extraction outcome per species, determined \nusing the hierarchy: Pass > Pass ULI > Pooling > Fail. The number inside each bar \nrepresents the percentage of species that have passed extraction in any way, not those that  \nfailed. To account for the wide range in species counts, orders with fewer than five species \nare excluded from this visualisation but are available in the supplementary material (Figure \nS6). \nWhen species are identified as being cultured in axenic or low bacteria conditions by sample \nproviders and with a predicted genome size of below 1 Gb we are generally able to \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n32 \nassemble genomes with ULI sequencing and Hi-C data. Protist samples predominantly \nprogress toward the ULI route, achieving a success rate of 83%. In contrast, we have lower \nsuccess rates (57%) with LI shearing protocols (calculated from data in [4]). Overall, our \ncombined success rate is 74%, with future work aiming towards improvement of the ULI \npipeline. The unusual genome structure of some protist species, for example ciliates, provide \nan extra challenge for fragmentation and size selection. Chromosomes are present in the \nrange of 5 kb to 20 kb and these would be removed using current size selection protocols, \nfuture methods to efficiently sequence these fragments may include fractionation of DNA \nextracts. \nMany protists feed on bacteria, or require their presence for growth, and for this reason \nsamples can be a mixture of protist and bacterial cells in culture. Sequencing yields for \nprotist ULI samples may be very good, achieving over 24 Gb per Sequel IIe cell. However, \nup to 99% of these reads can originate from co-cultivated bacteria within the sample rather \nthan the target protist. The importance of working with collectors to reduce this bacterial load \nis therefore fundamental to the progression of protist samples. \nFuture work in this area will focus on assessment of dual extraction protocols, aiming to \nextract easily lysed organisms and remove this DNA in a first pass, followed by a stronger \nchemical or physical cell lysis and DNA extraction for the remaining sample.  \nOther Metazoa and Macroalgae \nThe paraphyletic group of “other metazoa and macroalgae” contains a multitude of different \nphyla, predominantly a mix of marine and terrestrial invertebrates, but also including \nmacroalgae. This grouping is largely based on the focus of the species collectors, and their \naccess to species whilst sampling in marine environments. The large diversity of the species \nwithin this polyphyletic grouping provides many challenges and opportunities for new \ndevelopments.  \nSamples within this group are homogenised using either cryoPREP or powermashing, based \non the weight of the tissue available as described in the standard guidelines [16] and then \nsubjected to the automated MagAttract extraction process [24]. Species are matched with \nideal extraction protocols with an overall pass rate of 79% (Figure 6). This is relatively low \ncompared with other taxonomic groups and is not spread evenly, with for example Mollusca \nshowing a high success rate of 87%, whereas Platyhelminthes only achieve 29% (Figure \n15). \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n33 \nThe results of fragmentation processes for the orders contained in other metazoa show a \npass rate of 74% (calculated from data in [4]). ULI is a useful option when fragmentation \nresults are poor  as many of the fragmentation failures are associated with a low yield or a \npoor profile. ULI libraries dominate for the majority of orders, particularly Cnidaria where 54 \nspecies have been processed via LI and 108 via ULI library prep. Mollusca are the exception \nfor this trend, with 186 species processed for LI and only 66 for ULI.  \nFigure 15. Other metazoa and macroalgae extraction success metrics by taxon group. \nThe bar chart summarises the DNA extraction success per species across other metazoa \nand macroalgae taxon groups. The results are categorised as: Pass – DNA sufficient for \nsequencing achieved; Pass ULI - DNA sufficient for sequencing with ultra low input \nachieved; Pooling - two DNA extractions were pooled to meet QC threshold; Fail - \nextractions have failed to provide sufficient DNA to proceed. The results represent the best \nDNA extraction outcome per species, determined using the hierarchy: Pass > Pass ULI > \nPooling > Fail. The number inside each bar represents the percentage of species that have \npassed extraction in any way, not those that  failed. To account for the wide range in species \ncounts, a logarithmic scale is used, and taxon groups with fewer than five species are \nexcluded from this visualisation but are available in the supplementary material (Figure S7). \nMolluscs \nWhile many molluscs pass via the routine MagAttract protocol [20,24], the Nanobind [36] \nmethod is used as the second option for those that fail (Figure 1). An example of this is \nColus gracilis (Gastropoda; Graceful whelk), which failed consistently for DNA quality when \nextracted using the MagAttract Protocol [24] (Figure 16). However, when processed using \nthe Nanobind protocol [36] DNA with a high molecular weight peak and a profile suitable for \nsequencing was obtained. The Nanobind extraction also increased the overall yield tenfold. \nThis result may be conflated by a difference in the tissue preparation for these protocols, as \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n34 \nthe MagAttract samples were disrupted in the cryoPREP (Covaris, Woburn, MA) whilst the \nNanobind samples were finely diced with a scalpel as per the protocol. \n \nFigure 16. FemtoPulse profiles of the mollusc Colus gracilis (Gastropoda) DNA \nextracts following different protocols  \nThe samples extracted using the automated MagAttract protocol (Yellow and Red) yielded \nonly LMW DNA and are not suitable for progression. The results from the Nanobind protocol \n(Black) show a significant improvement, both in the abundance of HMW DNA and also the \nabsence of LMW DNA. After fragmentation and clean up of the Nanobind extracted DNA, the \nresulting peak fragment size of 18 kb (Blue) was ideal for progression to library prep. \nCnidaria \nCnidaria samples have also proved challenging, with corals causing difficulties during \nsample homogenisation, and jellyfish yielding low quality and quantity DNA following routine \nDNA extraction. One of the big challenges of extracting DNA from corals has been in \ndisrupting hard stony corals into a fine powder that facilitates extraction. The deployment of \nthe Fast-prep96™ has enabled faster and more complete disruption of coral tissue via a \nscalable method using 4 ml polycarbonate vials and a single 6mm zirconium oxide grinding \nbead, and otherwise following the plant bead beating protocol [19]. This approach to sample \ndisruption has improved the MagAttract extraction success of hard corals, resulting in an \nincrease in samples that yielded DNA suitable for LI or ULI sequencing. The bead-beaten \ncoral samples have also been successfully used for Hi-C cross linking and subsequent \nlibrary preparation, completing the data set required for reference level genome assembly.  \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n35 \nFor salps and jellyfish, a new extraction protocol was developed, using the recommended \nlysis steps of the Omega Bio-Tek E.Z.N.A. Mollusc and Insect DNA kit (Item: D3373-00S \nfrom Omega Bio-Tek, Norcross, GA.) combined with the SpeedBead-based extraction \nmethod used in the POE protocol, which generated a higher quality and quantity of DNA. \nThis protocol required inputs of 100 - 200 mg of fresh frozen tissue; lower input amounts and \nethanol preserved tissues could also be used, however the resulting DNA yield may be \nlower. The DNA extracted using the Modified Omega Bio-Tek E.Z.N.A. protocol [37] was \nsuitable for either ULI or LI sequencing, enabling the reference level assembly of multiple \nsalp and jellyfish genomes. \nPorifera \nInitially, processing through the routine protocols of cryoPREP [18] and MagAttract v2 [38] \nyielded a significant portion of LMW DNA within the extract. Research into homogenisation \nmethods that have been used identified the “squeeze” method [27], which aims to maintain \nthe integrity of the sponge cells whilst removing them from their skeletons (siliceous and \ncalcareous spicules embedded in collagenous protein matrices). Samples of Eunapius \nfragilis (a freshwater demosponge) extracted with and without “squeezing” showed \nsignificantly increased yields of high molecular weight DNA in the squeezed sample. The \ncells separated via the squeeze method have also been successfully used to generate Hi-C \ndata, so all Porifera are now processed using the squeeze method. \nMacroalgae \nInitial work with macroalgal samples (Chlorophyta, Ochrophyta and Rhodophyta) began with \ntissue disruption via the cryoPREP [18] followed by the POE protocol [29], yielding DNA that \nwas of both poor quality and quantity. The samples were characterised by their tendency to \nbecome very viscous upon cell lysis, forming a gel like substance in the tube during \nextraction which significantly hindered further processing.\n This is due to the large \npolysaccharide content of the algae. The typical approach for macroalgae with a genome \nsize <1 Gb is therefore the ULI library prep method after the POE extraction method with a \nlower tissue input of 25mg in order to reduce the amount of contaminants in the sample. \nConclusion \nThe Sanger Tree of Life programme has scaled reference genome assembly production and \nhas released over 2000 chromosomally-resolved genome reference assemblies as of \nFebruary 2025. We aim to further increase genome production year on year, and \nstandardisation, refinement and streamlining of laboratory processes has been fundamental \nfor our continual improvements. The homogenisation methods, extraction protocols, and \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n36 \nshearing processes discussed here are enabling genome assemblies from a great diversity \nof species. In addition to release of the data freeze used to produce the summary statistics \npresented here [4], in order to further assist others working in the field, we have also made \nthe raw data from the Tree of Life laboratory work available via a searchable online ‘Portal’ at \nlinks.tol.sanger.ac.uk/datasets/tol-lab-data. This link is continuously updated with the work \nunderway and thus provides access to laboratory information as soon as the work has been \ncompleted. We hope this will be useful to examine both the details behind the summaries \noffered in this paper but also to explore the protocols used on future samples. For example, \nif a researcher is working on a challenging species that is a close relative of a species that \nhas come through the Tree of Life, the portal could be explored to understand which HMW \nDNA extraction protocols worked or did not work and thus save time in testing a variety of \napproaches. Alternatively, where a researcher has access to multiple tissue types for work, \nthe Portal may provide information as to how related species and tissue types have \nperformed in extraction and downstream sequencing, informing decision making. \nOur experience shows that building high quality reference genome assemblies is achievable \nfor the majority of species that have been collected alive and preserved using best practice \n(snap freezing in most cases), and have a suitable tissue availability-to-genome size ratio. \nChallenges remain in certain taxonomic areas, especially for species with large genomes \nand small body sizes. The new Ampli-Fi option from PacBio requires only 1 ng of sheared \nDNA to provide data for up to 3 Gb genome size and may help overcome some of these \nchallenges. Best practice is to avoid amplification whenever possible, and even here, the  \nrequirements for input DNA amounts are regularly decreasing with a recent four-fold \ndecrease in the amount of DNA needed for LI PacBio. The Sanger Tree of Life core lab \nbiobanks all DNA aliquots including those that did not meet the quality and yield required to \nprogress to sequencing at the time of their extraction. With these recent advances, we will \nnow return to biobanked DNA extracts that previously did not meet required yields as \nsequencing these to sufficient coverage may now be achievable.  \nFor species with picogram level DNA content, phi29 replicase amplification can be used on \nsingle meiofaunal organisms to generate long-insert library DNA. Picogram input Multimodal \nSequencing (PiMmS) [23] delivers a PacBio or ONT long-read compatible amplified DNA \nsample and full length cDNA from a single specimen. This has proven successful for a \nnumber of species [39], and work will continue to standardise and ramp up the use of this \ntype of method. \nFuture work will continue to focus on the species that fail at different stages of the process, \nto develop and implement methods for the processing of smaller samples with as little \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n37 \namplification as possible, and to explore the merits of different sequencing technologies. We \nwill continue to share our protocols and findings as soon as possible in the hope that global \nbiodiversity genomics efforts might benefit.  \n \nACKNOWLEDGEMENTS \nAll authors as well as the laboratory work discussed above were funded by the Wellcome \nSanger Institute Quinquennial Review award 2021-2026 to the Wellcome Sanger Institute \n(220540/Z/20/A). In addition, the majority of genome production for species among the first \n2000 discussed here was supported by Wellcome through the Darwin Tree of Life \nDiscretionary Award (218328) and by the Gordon and Betty Moore Foundation through the \nAquatic Symbiosis Genomics Project (Grant ID: GBMF8897, \nhttps:/ /doi.org/10.37807/GBMF8897).  \nWe thank the many hundreds of people who collected and identified species on behalf of the \nDarwin Tree of Life and Aquatic Symbiosis Genomics Projects, and the many colleagues in \nthese projects who shared their best methods with us. We also thank the staff of the \nWellcome Sanger Institute Scientific Operations teams who contributed to extractions, and \nconducted library preparation and sequencing. \n \n \n \n \n \n \n \n \n \n \n \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n38 \nReferences \n1. Lewin HA, Richards S, Lieberman Aiden E, Allende ML, Archibald JM, Bálint M, et al.. The \nEarth BioGenome Project 2020: Starting the clock. Proc Natl Acad Sci U S A. Proceedings \nof the National Academy of Sciences; 119:e21156351182022; \n2. Darwin Tree of Life Project Consortium. Sequence locally, think globally: The Darwin Tree \nof Life Project. Proc Natl Acad Sci U S A. Proceedings of the National Academy of Sciences; \n119:e21156421182022; \n3. Victoria McKenna, John M. Archibald, Roxanne Beinart, Michael N. Dawson, Ute \nHentschel, Patrick J. Keeling, Jose V. Lopez, José M. Martín-Durán, Jillian M. Petersen, \nJulia D. Sigwart, Oleg Simakov, Kelly R. Sutherland, Michael Sweet, Nicholas J. Talbot, \nAnne W. Thompson, Sara Bender, Peter W. Harrison, Jeena Rajan, Guy Cochrane, Matthew \nBerriman, Mara K.N. Lawniczak, Mark Blaxter: The Aquatic Symbiosis Genomics Project: \nprobing the evolution of symbiosis across the Tree of Life[version 2; peer review: 1 \napproved, 1 approved with reservations]. https://wellcomeopenresearch.org/articles/6-254 \nAccessed 2025 Feb 21. \n4. Howard C, Denton A, Jackson B, Bates A, Jay J, Yatsenko H, et al.. Supplementary data \nfor: “On the path to reference genomes for all biodiversity: lessons learned and laboratory \nprotocols created in the Sanger Tree of Life core laboratory over the first 2000 genomes.” \nZenodo; \n5. : Bring structure to your research. protocols.io. https://www.protocols.io/ Accessed 2024 \nOct 2. \n6. : Tree of Life at the Wellcome Sanger Institute - research workspace on. protocols.io. \nhttps://www.protocols.io/workspaces/wellcome-sanger-institute13 Accessed 2025 Mar 13. \n7. : The R Project for Statistical Computing. https://www.r-project.org/ Accessed 2024 Dec \n12. \n8. : Tableau: Business intelligence and analytics software. Tableau. \nhttps://www.tableau.com/en-gb Accessed 2025 Feb 24. \n9. Challis R, Kumar S, Sotero-Caio C, Brown M, Blaxter M. Genomes on a Tree (GoaT): A \nversatile, scalable search engine for genomic and sequencing project metadata across the \neukaryotic tree of life. Wellcome Open Res. F1000 Research Ltd; 8:242023; \n10. Strickland M, Cornwell C, Howard C. Sanger Tree of Life Fragmented DNA clean up: \nManual SPRI. protocols.io. 2023; doi: 10.17504/protocols.io.kxygx3y1dg8j/v1. \n11. Oatley G, Sampaio F, Howard C. Sanger Tree of Life Fragmented DNA clean up: \nAutomated SPRI. protocols.io. 2023; doi: 10.17504/protocols.io.q26g7p1wkgwz/v1. \n12. DeAngelis MM, Wang DG, Hawkins TL. Solid-phase reversible immobilization for the \nisolation of PCR products. Nucleic Acids Res. Oxford University Press (OUP); \n23:4742–31995; \n13. do Amaral RJV, Cornwell C, Howard C. Sanger Tree of Life RNA Extraction: Manual \nTRIzolTM. protocols.io. 2023; doi: 10.17504/protocols.io.yxmvm334nl3p/v1. \n14. do Amaral RJV, Bates AAB, Denton A, Yatsenko H, Jay J, Howard C. Sanger Tree of Life \nRNA Extraction: Automated MagMaxTM mirVana. protocols.io. 2023; doi: \n10.17504/protocols.io.6qpvr36n3vmk/v1. \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n39 \n15. : Report on sample collection and processing standards. Earth BioGenome Project. \nhttps://www.earthbiogenome.org/sample-collection-processing-standards Accessed 2024 \nOct 16. \n16. Jay J, Yatsenko H, Narváez-Gómez JP, Mbye H, Morra M, Strickland M, et al.. Sanger \nTree of Life Sample Preparation: Triage and Dissection. protocols.io. 2023; doi: \n10.17504/protocols.io.x54v9prmqg3e/v1. \n17. Denton A, Oatley G, Cornwell C, Quail M, Howard C. Sanger Tree of Life Sample \nHomogenisation: PowerMash. protocols.io. 2023; doi: \n10.17504/protocols.io.5qpvo3r19v4o/v1. \n18. Narváez-Gómez JP, Mbye H, Oatley G, Strickland M, Park N, Howard C. Sanger Tree of \nLife Sample Homogenisation: Covaris cryoPREP® Automated Dry Pulverizer. protocols.io. \n2023; doi: 10.17504/protocols.io.eq2lyjp5qlx9/v2. \n19. Jackson B, Howard C. Sanger tree of life sample homogenisation: Cryogenic bead \nbeating of plants with FastPrep-96. protocols.io. 2023; doi: \n10.17504/protocols.io.rm7vzxk38gx1/v1. \n20. Strickland M, Moll R, Cornwell C, Smith M, Howard C. Sanger Tree of Life HMW DNA \nExtraction: Manual MagAttract. protocols.io. 2023; doi: \n10.17504/protocols.io.6qpvr33novmk/v1. \n21. Sheerin E, Sampaio F, Oatley G, Todorovic M, Strickland M, do Amaral RJV, et al.. \nSanger Tree of Life HMW DNA Extraction: Automated MagAttract v.1. protocols.io. 2023; \ndoi: 10.17504/protocols.io.x54v9p2z1g3e/v1. \n22. Todorovic M, Howard C. Sanger Tree of Life HMW DNA Extraction: Manual Plant \nMagAttract v.1. protocols.io. 2023; doi: 10.17504/protocols.io.n92ldmmx9l5b/v1. \n23. Laumer C. Picogram input multimodal sequencing (PiMmS). protocols.io. 2023; doi: \n10.17504/protocols.io.rm7vzywy5lx1/v1. \n24. Oatley G, Denton A, Howard C. Sanger Tree of Life HMW DNA Extraction: Automated \nMagAttract v.2. protocols.io. 2023; doi: 10.17504/protocols.io.kxygx3y4dg8j/v1. \n25. Bates AAB, Clayton-Lucey I, Howard C. Sanger Tree of Life HMW DNA Fragmentation: \nDiagenode Megaruptor®3 for LI PacBio. protocols.io. 2023; doi: \n10.17504/protocols.io.81wgbxzq3lpk/v1. \n26. Oatley G, Sampaio F, Kitchin L, do Amaral RJV, Howard C. Sanger Tree of Life HMW \nDNA Fragmentation: Covaris g-TUBE for ULI PacBio. protocols.io. 2023; doi: \n10.17504/protocols.io.q26g7pm81gwz/v1. \n27. Lopez J. Squeeze” enrichment of intact cells (eukaryotic and prokaryotic) from marine \nsponge tissues prior to rou. protocols.io. 2022; \n28. Denton A, Thomas A, Howard C. Sanger Tree of Life HMW DNA extraction: Automated \nMagAttract for small arthropods v1. protocols.io. \n29. Jackson B, Howard C. Sanger Tree of Life HMW DNA Extraction: Automated Plant \nMagAttract v.4. protocols.io. 2023; doi: 10.17504/protocols.io.8epv5xrd5g1b/v1. \n30. Jackson B, Howard C. Sanger Tree of Life HMW DNA Extraction: Plant Organic HMW \ngDNA Extraction (POE). protocols.io. 2023; doi: 10.17504/protocols.io.3byl4qq4zvo5/v1. \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint \n\n40 \n31. Jackson B, Howard C. Sanger Tree of Life HMW DNA Extraction: Hypertonic Washing of \nPlant Tissue Homogenates. protocols.io. 2024; doi: 10.17504/protocols.io.yxmvm9n36l3p/v1. \n32. Denton A, Oatley G, Biosciences P, Howard C. Sanger Tree of Life HMW DNA \nExtraction: Manual Nucleated Blood Nanobind®. protocols.io. 2023; doi: \n10.17504/protocols.io.5jyl8p2w8g2w/v1. \n33. Biosciences P, Bates A, Denton A, Howard C. Sanger Tree of life HMW DNA extraction: \nAutomated nucleated blood Nanobind® v1. protocols.io. \n34. LaJeunesse TC, Lambert G, Andersen RA, Coffroth MA, Galbraith DW. SYMBIODINIUM \n(PYRRHOPHYTA) GENOME SIZES (DNA CONTENT) ARE SMALLEST AMONG \nDINOFLAGELLATES1. J Phycol. Wiley; 41:880–62005; \n35. Jackson B, Howard C. Sanger Tree of Life HMW DNA Extraction: Manual Plant \nMagAttract v.4. protocols.io. 2023; doi: 10.17504/protocols.io.261ged5k7v47/v1. \n36. Biosciences P, Bates AAB, Howard C. Sanger Tree of Life HMW DNA Extraction: Manual \nMollusc Nanobind®. protocols.io. 2023; doi: 10.17504/protocols.io.14egn36nyl5d/v1. \n37. Denton A, Howard C. Sanger Tree of Life HMW DNA extraction: Modified Omega \nBio-Tek E.Z.N.A.® v1. Protocols.io. \n38. Todorovic M, Oatley G, Howard C. Sanger Tree of Life HMW DNA Extraction: Automated \nPlant MagAttract v.2. protocols.io. 2023; doi: 10.17504/protocols.io.36wgq3n13lk5/v1. \n39. Stevens L, Martínez-Ugalde I, King E, Wagah M, Absolon D, Bancroft R, et al.. Ancient \ndiversity in host-parasite interaction genes in a model parasitic nematode. Nat Commun. \nNature Publishing Group; 14:77762023; \n \n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}