On the path to reference genomes for all biodiversity: lessons learned and laboratory protocols created in the Sanger Tree of Life core laboratory over the first 2000 species

preprint OA: gold CC-BY-4.0 ⤵ 1 in-corpus citation
📄 Open PDF Full text JSON View at publisher
Full text 96,469 characters · extracted from oa-pdf · 16 sections · click to expand

Keywords

reference genome, HMW DNA, extraction, sequencing, biodiversity, plant, arthropod, fungi, chordate, protist, long read, Hi-C, protocol,

Abstract

Since its inception in 2019, the Tree of Life programme at the Wellcome Sanger Institute has released high-quality, chromosomally-resolved reference genome assemblies for over 2000 species. Tree of Life has at its core multiple teams, each of which are responsible for key components of the ‘genome engine’. One of these teams is the Tree of Life core laboratory, which is responsible for processing tissues across a wide range of species into high quality, high molecular weight DNA and intact RNA, and preparing tissues for Hi-C. Here, we detail the different workflows we have developed to successfully process a wide variety of species, covering plants, fungi, chordates, protists, arthropods, meiofauna and other metazoa. We summarise our success rates and describe how to best apply and combine the suite of current protocols, which are all publicly available at protocols.io. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 2

Background

In recent years, advances in long read sequencing technologies have enabled genome assembly to an unprecedented quality and quantity. These advances underpin the goal of the Earth BioGenome Project (EBP), which is to create high quality reference genomes for all described eukaryotic species [1]. This ambitious project faces many challenges from collecting and identifying species at scale, to extracting sufficiently high quality and quantity of DNA and RNA from a wide range of taxa, to sequencing, assembling and annotating extraordinarily diverse genomes. It is this central DNA extraction challenge that we address here, alongside sharing all protocols that enable our work. The EBP goal will only be met through open and rapid sharing of key protocols and pipelines. The Tree of Life (ToL) programme at the Wellcome Sanger Institute is a major contributor to EBP goals. Over the past five years, we have extracted DNA and RNA from 41 phyla representing 4883 species under projects such as Darwin Tree of Life [2] and Aquatic Symbiosis Genomics [3]. We have released dozens of protocols at the Sanger Tree of Life Workspace on protocols.io to assist others in their efforts to carry out the laboratory work necessary to generate high quality reference genomes. These protocols for tissue preparation, high molecular weight (HMW) DNA extraction, fragmentation and clean-up, and RNA extraction have been applied at scale with standardised quality control (QC) measurements at key stages. Here we share both the routine processes that we employ as a first pass for organisms from a variety of different taxonomic groups as well as the approaches we take when we encounter failures. We also share things that we have learned along the way regarding specific challenges presented by different taxonomic groups, sample types, and species. The work presented here provides a summary of our first five years of work, with a frozen data set [4] used to enable us to provide success rates and review progress. Work in all of these areas is also currently ongoing, with new species being processed daily and new protocols developed to improve output and efficiency. Developing a standardised pipeline for the processing of diverse biological samples for reference genome assembly The path from specimen to genome assembly requires optimal execution of complex wet lab, sequencing and informatic processes. In ToL, we use long read genomic sequencing and Hi-C chromatin conformation sequencing for reference genome assembly, and produce transcriptomic data through short read RNA-seq for primary annotation of completed genomes. Focussing on the wet-laboratory work, we have standardised the processes to allow progress at pace. In general, the laboratory steps required for generation of high-quality reference genomes are: .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 3 1. Sample Preparation: This step involves assessment of extremely diverse samples varying in size, density, morphology, and chemistry. Tissue typically will undergo some kind of homogenisation and be aliquoted for different pipelines (RNA extraction, DNA extraction, and Hi-C). 2. HMW DNA extraction: The protocols associated with this step comprise the most diverse set of protocols depending on the target taxon and the nature of the available tissue. 3. HMW DNA Fragmentation: Our primary long read sequencing approach over the past five years has been PacBio HiFi, and this requires molecules in the 12-22 kb range, which is shorter than typical HMW DNA extractions. 4. Fragmented DNA clean up: After shearing, it is important to perform a clean up to remove low molecular weight (LMW) DNA as well as compounds and inhibitors that may have co-extracted with or bound to DNA to achieve good sequencing results. 5. Hi-C: Samples are crosslinked to preserve the 3D structure of the genome, digested with restriction enzymes, biotin-labeled, and proximity-ligated before short read sequencing. 6. RNA extraction: RNA of sufficient quantity and quality for genome annotation is extracted and sequenced. We have adapted and further developed protocols for the steps above from a wide range of primary sources. These protocols have been written in a modular way so they can each be used in conjunction with one another, depending on the taxonomy, tissue type and mass of the sample. They have all been published on protocols.io [5] in the Sanger Tree of Life Workspace [6], where we will continue to publish new protocols as we develop and deploy them. We encourage people who modify or improve these protocols to “fork” them on protocols.io and make them publicly available to the wider biodiversity genomics community. The datasets that have been produced during this work are provided as supplementary

Material

[4], with one file containing data pertaining to the DNA extraction results, one to the DNA fragmentation results, and another one to the RNA extraction results as well as a data dictionary to facilitate interpretation. These files provide a more detailed view into the performance of various protocols on a wide range of species (e.g. there are nearly 5000 species in the DNA extraction results). As work continues, access to the ever growing data set has been made available via a searchable online ‘Portal’ at links.tol.sanger.ac.uk/datasets/tol-lab-data. All statistical analyses presented were performed .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 4 using the statistical programming language R (Version 4.4.1 [7]), with data visualisations supported by Tableau [8] software (Version 2024.2.1). The typical sample follows a four step path for HMW DNA extractions and processing (Figure 1), with samples branching off for Hi-C and RNAseq. First, a sample is examined and weighed and the taxonomy is noted using the information provided by the collector. Based on these features it is then directed into one of the three homogenisation protocols. The outcome of each of these protocols is three samples per species; two ‘tissue prep’ samples that can be directed toward any HMW DNA or RNA extraction protocol, and another sample to enter the Hi-C protocol. The processing of tissue to enter our Hi-C protocol differs depending on taxonomy (described in Table 1). Separately, HMW DNA is extracted from the prepared sample. Currently, we have ten HMW DNA extraction protocols, and one pre-extraction treatment each one optimised for different taxonomy and tissue types, detailed further below. All of our protocols are version controlled with the version number in the document name. Retired versions remain available on protocols.io but we advise using the most recent version number for any given protocol. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 5 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 6 Figure 1. A tube map of Tree of Life protocols for HMW DNA extraction and processing. The publicly available ToL protocols for HMW DNA extraction and processing are categorised into the four numbered steps typically required to generate long read data for a high quality genome assembly. Each box here refers to a protocol available in the protocols.io Sanger Tree of Life Workspace and also linked to the Earth BioGenome Project Workspace. The current best practice is indicated by the route taken by samples from different taxonomic groups, shown by the coloured ‘tube lines’, and decision points to mark entry into these lines are discussed in the relevant taxonomic sections of this manuscript. The group ‘Other Metazoa’ includes mostly marine non-Chordata and macroalgae, and within ‘jellies’ are jellyfish and ctenophores. The output of each of these HMW DNA extractions protocols is a sample that can be fragmented using either of two methods, the selection of which is dependent on the quality and quantity of DNA in the sample, and the intended PacBio library type. There are two main types of long read library preparation: the conventional Low Input HiFi library (LI PacBio) and the amplification-based Ultra Low Input HiFi library (ULI PacBio). The LI PacBio approach requires at least 500 ng of 12-22 kb fragment size DNA per Gb of genome (i.e. a 2 Gb genome would require at least 1 µg of sheared DNA). The ULI PacBio approach requires a shorter fragment size of around 10 kb to enable successful amplification, but can be successful with as little as 20 ng of sheared DNA for smaller (< 1 Gb) genomes. Given these different input quantity and molecule length requirements, once we know the yield of DNA we have achieved in the HMW DNA extraction and its initial profile prior to shearing, together with an understanding of the predicted or known genome size obtained via GoaT (Genomes on a Tree database; https://goat.genomehubs.org/) [9], we choose the appropriate shearing approach. We use g-TUBES (Covaris, Woburn, MA) for ULI libraries and the Megaruptor (Diagenode, S.A.) for LI libraries. The output of these fragmentation

Methods

can be submitted to either of the two clean up protocols depending on the scale of the operation (manual [10] or automated [11]), both of which are Solid Phase Reversible Immobilisation (SPRI) [12] methods. Finally, RNA extraction is carried out on a separate tissue aliquot from the same species and where possible, the same organism. We deploy either a manual TRIzol protocol [13] or an automated MagMAX mirVana protocol (Thermo Fisher Scientific, UK) [14]. The modularity of these protocols allows for flexibility and a high throughput, while maintaining a standardised workflow. Having processed thousands of samples through these protocols, we have been able to monitor successes and failures that are both obvious (e.g. they yielded insufficient quality or quantity of DNA) as well as less obvious (e.g. where the DNA passed QC but still failed to generate sequence data). In practice, monitoring outcomes across diverse taxa has enabled us to generate reference level genomes at pace for those .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 7 taxonomic groups that tend to yield good DNA and results with the protocols above, while also highlighting taxonomic groups that do not proceed well and thus require further attention and R&D protocol development. Sample selection and data generation in an ideal world In an ideal process, all data for a species’ genome assembly would be generated from the same individual, avoiding issues of sequence diversity among individuals. However, this is often not possible due to limited tissue availability. Assembly algorithms work best when all long-read data for the contig assembly is produced from a single individual, therefore we always aim to access sufficient DNA to preclude needing to start long read processes over with a new individual. Generation of Hi-C and RNAseq data from the same individual as was used for long read data generation is optimal, but for these processes, using different individuals is viable. When two or more different individuals must be used for data generation, ideally data from long read and from Hi-C should be from the same sex such that the Hi-C data represents the full complement of chromosomes present in the long read data. Any individual can be used to produce transcriptomic data, and in instances where several individuals are available it is possible to start all of these lab processes in parallel. Individuals should be selected bearing in mind their biology, e.g. the heterogametic sex and non-polyploid samples are preferred following EBP guidance [15]. When specimens are large enough for dissection, or where multiple tissue types are available for a species, different tissues can be selected for different processes. For example, in insects, we would usually generate long read data and Hi-C from the head and thorax, and only use the abdomen for RNAseq if necessary. This avoids sequencing the microbiome present in the gut, and, in parous females, any sperm or embryos present in the reproductive tract. Our general rule is to avoid tissues that might contain organisms in addition to the target species. While it is interesting to assemble the cobionts in a sample, the additional sequencing data required, and the complexity of the subsequent assembly task argues against these tissues as sources for genomic DNA isolation. The amount of long read data required for genome assembly is dependent on genome size and is typically described in terms of coverage (e.g. 25x coverage of a diploid 1 Gb genome = 25 Gb of data required to give 12.5x coverage per haplotype). These calculations increase where polyploidy is present, as 12.5x coverage per haplotype is the minimum required. For this reason, the predicted haploid genome size and ploidy for a species is retrieved from GoaT [9] to help determine the initial amount of sequencing required. For most species a directly measured genome size is not available and estimates from an average of the nearest taxonomic neighbours are used. Whilst these estimates can be inaccurate, they .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 8 provide a reasonable starting point for sequencing efforts, which can be adjusted based on k-mer based genome size and ploidy estimates obtained from initial data. Figure 2. Process flow from species samples to the full data set required for genome assembly depending on tissue availability. Ideally, one individual specimen should provide tissue for generation of all data types. For many smaller organisms this is not possible, and a second or third individual may be required (indicated as individuals 1, 2 and another). Importantly, all long-read data for the initial contig assembly must be produced from one individual. If insufficient coverage is achieved from initial sequencing, this needs to be topped up with additional data generated from DNA from the same specimen. If there is very little DNA remaining, it may be possible to make a ULI library from what remains. Otherwise, long read data generation must start afresh from a new individual. The stated coverage follows the recommendations of the Tree of Life assembly pipeline at the time of writing. Sample Preparation Most samples are provided as small pieces of tissue, cell culture pellets, or whole small organisms, snap frozen at collection in 1.9 mL FluidX (barcoded) tubes, transported and stored at -70°C, in line with EBP guidance [15]. Where tissue is abundant, such as vascular plants, larger tubes (7.6 mL) are used to collect as much tissue as possible without compromising the integrity of the tissue. Once a sample has been selected for work in the laboratory, a process is followed with the aim of normalising the biologically diverse samples as much as possible, resulting in the production of a tube containing sufficient material for the next downstream process. The ideal amount of starting material is usually 25 mg for .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 9 animals, protists and fungi and 50 mg for plants. Despite the fact that many organisms weigh less than 25 mg in total, we progress these through the protocols and are often successful. All samples are weighed and divided based on their taxonomy, tissue type and size/mass of

Material

for disruption following the sample triage protocol [16]. Tissue homogenisation is a crucial step prior to HMW DNA extraction. We have used the Powermasher (Nippi, Japan), cryoPREP (Covaris, Woburn, MA) and FastPrep-96 (MPBio, CA) at scale following the guidelines set out in Table 1. In general, smaller samples are weighed and then powermashed [17] in the extraction lysis buffer at room temperature. The benefits of this

Method

are the ability to adapt the duration of the treatment to the requirements of the sample structure, and directing the disruption toward different parts of the sample as it disrupts, i.e. concentrating on more resistant pieces of tissue as they become apparent during the process. Importantly, there is no loss of tissue since the process occurs within the lysis buffer, and all material is immediately put into nucleic acid extraction without any tube transfer or pipetting. The drawback of this technique is its low throughput nature, with each sample requiring individual powermashing. Homogenisation at extremely low temperatures can be achieved using a pestle and mortar, and liquid nitrogen. This approach is inherently low throughput and it can be hard to avoid cross-contamination of samples through residual tissue on instruments. The cryoPREP instrument [18] (Covaris, Woburn, MA) solves the issue of cross contamination. Samples are placed into proprietary bags (TissueTUBEs) made of material resistant to extremely low temperatures and force (Figure 3). The whole bag containing only the tissue sample (no lysis buffer) is submerged in liquid nitrogen and then placed on the machine to be smashed between metal plates. The cryoPREP can be used repeatedly on the same sample, and the strength is adjustable. Therefore, the process can be continued until the sample is reduced to a fine powder. The bag can be repeatedly submerged in liquid nitrogen between pulverisations, maintaining low temperature and preventing degradation of nucleic acids within the sample due to the action of endogenous nucleases. We note that repeated processing can become labour intensive on the cryoPrep, and when processing a large number of samples maintaining the cold temperature is a challenge. Additionally, the proprietary bags cannot be reused and add significant cost to sample prep. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 10 Figure 3. TissueTUBE assembly for the cryoPREP. (a) Left to right: Covaris TT01 TissueTUBE, TT01/1.9 mL FluidX adapter and 1.9 mL FluidX tube. (b) assembly with sample in place. The sample is stored within the FluidX tube and once all parts are assembled, the assembly is inverted to allow the sample to move into the TT1 bag. The TT1 Adapter shown was 3D printed within the Sanger Institute. The FastPREP-96 bead beating approach is useful for plant tissue disruption [19]. FluidX tubes containing snap-frozen dry tissue samples (50-90 mg) are selected, and 3 x 3 mm stainless steel grinding balls added. Up to 48 of these sample tubes are then assembled into a rack which is submerged in liquid nitrogen to cool, and then carefully lifted allowing any excess to drain. This chilled rack of tubes is then placed on the FastPrep-96 instrument to be shaken at 1600 rpm for 30 seconds. This process, including the submersion, is repeated three times, after which all samples are reduced to a homogenous powder. There is no need to remove the beads from the tube before starting the lysis process, and performing this in the same tube prevents loss of tissue. Two racks of tubes can be processed in parallel, enabling a throughput of 96 samples. This technique has proven extremely successful for plant tissue disruption and is showing promise for other organisms and tissue types. For a set of test species (oak, ladybird, snail, yeast, and marine fungus), tissue disruption with FastPrep-96 achieved similar results as cryoPREP and powermasher, but with the processing advantage of scale and a standardised approach. While figure 1 shows the ideal sample weight and homogenisation method for our HMW DNA protocols, Table 1 shows the ideal preparation for RNA and Hi-C, each split by taxonomic grouping. The ideal preparation of samples is dependent on the process for which the sample is intended, which can make standardising decisions difficult when balanced against the diverse nature of the samples. For some sample types, such as protists, the ideal method of disruption has not yet been ascertained so the current best practice is shared here. Details are provided in the taxon specific sections of this manuscript to further describe the observations that can be drawn from our work so far. For other groups such as .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 11 chordates, arthropods and plants, our methods are robust and routine but the FastPrep-96 approach may replace powermashing and the cryoPREP and testing is underway. Table 1. Sample Preparation Guidelines for RNA Extraction and Hi-C for each taxonomic group Sample taxonomy Ideal input for RNA extraction Ideal input for Hi-C Arthropods 10 mg cryoPREP whole head or up to 20 mg tissue, no homogenisation Chordates 10 mg cryoPREP up to 20 mg tissue, no homogenisation Plants ⋟10 mg bead beaten 50 mg bead beaten Fungi ⋟10 mg bead beaten 50 mg bead beaten Protists ⋟10 mg bead beaten 20 mg bead beaten Other metazoa and macroalgae 10 mg cryoPREP up to 20 mg tissue, no homogenisation High Molecular Weight DNA Extraction Following the sample preparation and appropriate disruption process, samples progress to HMW DNA extraction. The ideal input weight and disruption method for different sample types is shown in Figure 1. A proportion of samples do not meet the minimum mass criteria, and it is therefore not possible to standardise the input for these samples. This does not prevent these samples from entering the process and contributes to our understanding of performance outside the ideal parameters. In order to minimise the number of different extraction protocols we use, our approach has been to first test a sample from every species using one standardised protocol. Samples that pass well through this extraction protocol will go forward to produce sequence data, and those that fail highlight the species groups that require further investigation. We use the Qiagen (Hilden, Germany) MagAttract HMW DNA extraction method [20] as our default first protocol due to the track record seen in the laboratory and the ability to automate on the KingFisher Apex [21] . For plant samples, an accompanying Plant MagAttract protocol [22] .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 12 has been implemented. Because the samples received are diverse, if a first extraction fails a second attempt is made with the same protocol. This allows for the selection of a new individual, or a different tissue type, and on many occasions this results in a successful DNA extraction. This may be due to many factors including individual differences within species, or to factors relating to the sample collection and preservation. A 10 µL molecular voucher (aliquot) of every DNA extraction performed is retained for deposition to museums, and in some cases, this voucher can be used in part for “top up” when slightly more long-read coverage is required. Standard quality metrics are collected for each sample after DNA extraction. Nucleic acid quantity is measured using the Qubit® dsDNA assay (Thermo Fisher Scientific, UK). We also assess DNA purity through spectrophotometry using a Lunatic spectrophotometer (Unchained Labs, Pleasanton, CA.). We measure the ratio of absorbances at 260 nm:280 nm, which is ideally ~1.8, and 260 nm:230 nm, which is ideally between 2.0 and 2.2. Deviation from the optimum for either of these measures indicates the presence of contaminants in the extraction, for example phenols or carbohydrates, that may interfere with downstream processes. The fragment length distribution of the HMW DNA is assessed using the FemtoPulse System (Agilent Technologies, Santa Clara, CA.) and their Genomic DNA 165 kb Kit. This pulsed-field capillary electrophoresis system measures concentration (through spectrophotometry) and length (based on retention time relative to standards) of the extracted DNA, and provides accurate sizing of fragments up to 165 kb. Above this size, ultra HMW fragments are visible, but the sizing is not accurate. The quality of DNA extracted from samples is highly variable, resulting in diverse FemtoPulse profiles. To assess the traces routinely and standardise decision making between users, we developed a categorisation system. We defined five profile classes: “LMW DNA”, “smear bulk 50 kb”, “HMW band plus smear” and :HMW band”. Model profiles representing each of these categories is shown in Figure 4. Over time, as we become more familiar with certain taxonomic groups (such as Lepidoptera), some samples are processed as scale without the routine labelling of profiles [4]. For more challenging sample groups still under active R&D, different aspects can be noted using a multi-select approach - for example “HMW band” and “LMW DNA” could both be selected for one sample [4]. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 13 Figure 4. Representative FemtoPulse molecular weight profiles of five DNA extractions, modelling the categories used within the Tree of Life. The X axis shows fragment size. Ideally most DNA would sit above 50 kb. This is demonstrated in the example traces for categories ‘HMW band’ (black), ‘HMW band plus smear’ (blue) and ‘Smear bulk >50 kb’ (red), where peaks in the trace are visible at approximately 160 kb. The profile ‘Smear, bulk <50 kb’ (yellow) is common and can be progressed best when it is possible to remove the smaller fragments - ideally removing everything below 10 kb. Finally, the category ‘LMW DNA’ (green) is a failure for downstream long read sequencing. In case of good DNA quality but low yield on first extraction, there are two options. In some cases, it may be possible to extract again from the same specimen and pool samples in order to achieve the required quality and quantity of DNA. If this is not possible, samples with 20 ng of DNA after shearing and SPRI cleanup. Samples with <100 ng of DNA may be progressed along this route if there is no option to repeat the DNA extraction, i.e. there is no tissue remaining. With the new PacBio Ampli-Fi kit, these thresholds are likely to be 1 ng per 3 Gb genome size. For species where even this quantity of DNA is not achievable, picogram-input methods like PiMmS are available [23]. The ULI option is restricted to species with a genome size of <1 Gb due to the impact of amplification bias and the resulting poor coverage of specific genomic regions that affects downstream assembly quality. We have introduced a 0.45X SPRI step directly after extraction to remove DNA fragments <10 kb [24] . Because the FemtoPulse trace shows relative absorbance normalised to the maximal value, it can be difficult to assess the fragment distribution for samples with .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 14 significant amounts of LMW DNA. The removal of shorter fragments via the SPRI cleanup enables more accurate analysis of the HMW DNA profile, demonstrated by analysis of Martes martes (pine marten) samples (Figure 5). Improving the analysis and decision making process at this point enabled a reduction in the quantity threshold for passing extractions through to fragmentation. Figure 5. Overlaid FemtoPulse profile for DNA extractions from Martes martes heart tissue (Mammalia; pine marten). The overlaid profiles show the impact of performing a 0.45 SPRI after DNA extraction. The black trace (manual DNA extraction with no SPRI), shows very little detail due to the large LMW peak, whilst the blue trace (automated DNA extraction with a SPRI) reflects the profile of the remaining DNA with significantly more detail allowing for informed decision making. The overall success rates of samples within laboratory extraction processes can be broken down by taxonomic group (Figure 6). Overall we find that chordates and plants progress well, showing the highest HMW DNA extraction pass rate of 96% (91.2 Pass, 4.0% Pass ULI, 1.1% Pooling), and 91% (84.3% Pass, 5.5% Pass ULI, 0.9% Pooling), respectively. The number of species within the arthropods dwarfs the other sample groups, with 2373 species having been processed with a total pass rate of 85%. The highest HMW DNA extraction fail rate is observed in fungi, at 34.2%. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 15 Figure 6. Success rates of DNA extractions per species from six taxonomic groups. The bar chart summarises the DNA extraction success per species across the six taxonomic groups. The results are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - DNA sufficient for sequencing with ultra low input achieved; Pooling - two DNA extractions were pooled to meet QC threshold; Fail - extractions have failed to provide sufficient quality and/or quantity DNA to proceed. The results represent the best DNA extraction outcome per species, determined using the hierarchy: Pass > Pass ULI > Pooling > Fail. The number to the right of each bar represents the total number of species processed within each taxonomic group, and the percentages inside the bars indicate the proportion of species in each category. HMW DNA fragmentation Megaruptor fragmentation The DNA fragmentation protocol [25] using the Megaruptor 3 (Diagenode, S.A.) instrument, forces extracted DNA through a single-use hydropore connected to a syringe at a controlled rate, enacting mechanical shearing upon the DNA within the solution. The system can be used at various speeds, producing fragments of different median length. For LI PacBio libraries the sheared DNA lengths should be in the range of 12 - 22 kb, a tight peak (e.g. most DNA at 18 kb). The main challenge with Megaruptor shearing is pore blocking, where a syringe fails to pull the sample back through the hydropore after the initial pass. This occurs without warning and is largely unpredictable, although it is more common with visibly viscous DNA extractions. One way to overcome blockage is by diluting samples and running portions .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 16 of the sample through the syringe over multiple stages, or transferring the extract to a different hydropore syringe designed for viscous samples (Cat. No. E07020001). Occasionally a sample does not shear fully during this process, which is clear only after observing the fragment size distribution on the FemtoPulse. A second attempt at shearing is then made which typically completes the shearing, however, if a minority of longer fragments persist the samples are still progressed to library prep where they will be removed in later size selection processes. g-TUBE fragmentation The incorporation of DNA amplification into the ULI library prep method allows for a significantly lower input (as little as 20 ng post shearing), and a shorter fragment length of 9-11 kb. This fragment length can be routinely and reliably achieved using a g-TUBE (Covaris, Woburn, MA) with our standard protocol [26], which relies on shearing due to forcing the DNA through a narrow aperture membrane. While Megaruptor shearing can also be used to generate the smaller fragments needed for ULI, g-TUBEs are preferable because of faster processing time and more reliable output. Unlike the Megaruptor syringes, the g-TUBE only requires the use of a microcentrifuge for shearing. Occasionally a sample fails to pass through the g-TUBE and requires a second spin, but this is the extent of troubleshooting required for this method. Fragmented DNA clean up After fragmentation with either method, the DNA is cleaned and concentrated again using SPRI beads, either manually [10] or automated on the Kingfisher APEX [11]. This process both purifies the DNA and removes shorter fragments. Following this, the DNA is evaluated using Qubit and Lunatic spectrophotometry and FemtoPulse electrophoresis. Samples meeting the criteria for LI or ULI sizes and yields progress through to PacBio library preparation. The QC results inform a decision-making process, resulting in an output of either a ‘Pass’ or ‘Fail’ post-shearing. The pass rates for shearing are relatively high – from 89% for chordates to 67% for arthropods (calculated from data in [4]). The ULI path provides a route for low yield samples, and also those that fail because their short fragment length is not ideal for LI. Protist samples exemplify this, with a pass rate of 20% in LI fragmentation and 54% in the ULI method (calculated from data in [4]). DNA samples that fail in both shearing options would require another DNA extraction event. Since the introduction of a SPRI after DNA extraction and before fragmentation, there has been an increase in the fragmentation pass rate. Samples that would previously have failed at this point are now removed earlier in the .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 17 process, meaning that less time is spent processing samples that are not suitable for sequencing. Long Read Sequencing Ultimately, the success or failure of a HMW DNA extraction for the purposes of long read sequencing is judged by sequencing yield, read quality and read length. For the PacBio platforms, raw sequencing reads from circular library molecules that contain multiple reads across both strands of the insert DNA are automatically error corrected to generate circular consensus sequencing (CCS) output reads of high per-base quality. On the Sequel IIe platform, a CCS yield over 20 Gb is considered good, while a yield of 10-20 Gb is not ideal but still potentially adequate depending on genome size. Finally, a yield below 10 Gb is considered poor and a target for improvement. The Revio platform is designed to yield three times this output, and performance is judged in line with this. With the aim of producing 12.5x coverage per haplotype, library multiplexing on the Revio is advised unless work is taking place on large genome organisms (e.g. > 2 Gb) to avoid overproduction of data. If a sample is sequencing well but has not reached the required coverage, a ‘top-up’ of data from the same genetic individual is needed. Recent work in this area has shown that the longevity of both LI and ULI libraries is greater than had been anticipated, with examples of both surviving storage at -70°C for over 9 months before performing equally well on a second run. Where no library or DNA remains, the DNA voucher (a 10 µL aliquot taken from all extractions) can be valuable for ULI prep. Small diploid genomes (<0.5 Gb) can reach 25x coverage from low cell yield (i.e. <10 Gb), but this poor performance remains a target for improvement. As the data are collected and accumulated, trends in lower CCS yield for different taxonomic groups become indicators of R&D need. The data yield required for successful assembly is based on the genome size of the target organism within a sample. However, as samples are collected from wild environments other species are often present within a sample (e.g. the microbiome, pathogens and parasites). The bioinformatics pipelines in place to process data are capable of filtering out data that derives from these “cobionts”. In many cases the reads from non-target organisms are sufficient to generate cobiont genome assemblies as a by-product of the attempts to sequence the target species. However, occasionally non-target species can be present in such abundance as to prevent the sequencing of the target species or make it too costly to continue sequencing to achieve required coverage for the target species. This scenario may arise from cultured species that require the presence of other organisms to grow, for .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 18 example protists that feed on bacterial species. These cases present a challenge for

Reference

genome pipelines and purification of the sample upstream of DNA extraction is recommended. The output from the standardised workflow described results in a range of CCS yields that shows variation for each taxonomic group for both ULI and LI submissions (Figure 7). We compared data production for all libraries run as one species’ library per cell on both PacBio platforms, Sequel IIe and Revio. To present standardised results, data from multiplexed samples were not included. Comparing submission types, the average yield of runs on the Revio instrument for LI libraries ranged between 48 and 69 Gb, and from 65 to 74 Gb for ULI submissions. On the Sequel instruments, the yield ranged between 17 and 25 Gb for LI submissions, and 19 to 24 Gb for ULI submissions. Arthropod species had a fairly consistent yield regardless of instrument or library preparation techniques. However, fungi had more variable yields, with low average yields of 17 Gb with the LI library prep method but much improved average yield of 23 Gb when ULI libraries were sequenced, on the Sequel IIe platform. Our experience with sequencing fungi on the Revio is limited but in line with the Sequel IIe, with approximately three-fold higher yields for LI libraries (50 Gb). Most fungal species are now directed to the ULI library pipeline because of low DNA yields. Overall, ULI libraries show less variation in yield within each taxonomic group than LI libraries, as would be expected for PCR-amplified DNA when compared with native. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 19 Figure 7. Distribution of CCS yield per taxonomic group. The distribution of the CCS yields across taxonomic groups is shown for each instrument type (Revio and Sequel) and library type (LI: Low Input; ULI: Ultra Low Input). Each box spans the interquartile range, with the lower and upper edges indicating the 25th and 75th percentiles, respectively. The horizontal line within each box represents the median CCS yield value per taxonomic group. Whiskers extend to the furthest data points within 1.5 times the interquartile range, while data points outside of this range are considered outliers. Data analysed included all single specimen libraries (i.e. non-multiplexed) sequencing runs, regardless of extraction method. The label under each subplot refers to n, the number of sequencing runs within each sub-category. In addition to the CCS yields varying across species, they can also vary within a species. For example, in the case of the newt, Lissotriton vulgaris, which has a large genome (24 Gb), a single library was made from DNA extracted from muscle. This library was run on seven Sequel IIe cells, with CCS yields ranging from 18 to 32 Gb per cell. This variation and unpredictability in yield presents challenges for scaling production. The recent introduction of SPRQ preloading normalisation on the Revio system will, we hope, reduce this unwanted variability. To more accurately target required coverage based on predicted genome size, many libraries are now multiplexed, with 2, 4 or 8 libraries run on one Revio cell in parallel. Ideal plexing in terms of molarity, taxonomy and fragment length is not possible as each specimen varies in its final library insert size profile. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 20 Hi-C Library prep and sequencing A tissue aliquot to be used for Hi-C library prep is created for each species during the sample preparation process. The guidelines for the amount and disruption of each sample type is shown in Table 1. We use the Arima Genomics (Carlsbad, CA, US) Hi-C v2 kit. Three distinct fixation protocols are used depending on the taxon group. For animals, we follow the Arima high coverage kit recommendation for animal tissue, which involves fixation by 2% formaldehyde for 20 minutes in TC buffer (Arima). For plant and algal samples, we follow the Arima high coverage recommendation for mammalian cell lines, which involves nuclei isolation using the Qiagen Qproteome Cell Compartment kit followed by fixation with 2% formaldehyde for 10 minutes in 1x PBS buffer. Finally, for sponges (Porifera) that have been prepared via the “squeeze” method [27] to create a cell pellet, we carry out fixation with 2% formaldehyde for 10 minutes in 1x PBS buffer. Hi-C is performed according to manufacturer’s recommendations except that the number of PCR cycles used in Illumina library amplification is directed by the DNA concentration post adapter ligation and streptavidin enrichment as measured using Qubit dsDNA high sensitivity kit (Thermo Fisher Scientific, UK), rather than determining amplification cycles by qPCR as in Arima QC2 procedure. The following PCR cycle guidelines are used: If >8 ng/µL DNA in post streptavidin enrichment quantification use 8 cycles of PCR; If >2 ng/µL DNA in post streptavidin enrichment quantification use 10 cycles of PCR; If >0.5 ng/µL DNA in post streptavidin enrichment quantification use 12 cycles of PCR; If >0.1 ng/µL DNA in post streptavidin enrichment quantification use 14 cycles of PCR; For lower concentrations use 16 cycles PCR. Libraries are sequenced using Illumina (San Diego, CA, US) short read technology on the NovaSeqX, 150 B paired end reads on the 25B flow cell. Libraries are multiplexed such that 25x coverage per haplotype of the genome is aimed for for each sample. Grouping together samples with genome sizes of 4 Gb can be useful for achieving desired plexing levels. RNA extraction For all species we also extract and sequence mRNA to provide data for gene annotation. These data would ideally be produced from several different tissue types for each species, as this is most beneficial for gene annotation, but this is often not possible due to small organism size or restricted number of tissues collected. Originally, a manual TriZol method [13] was applied that achieved a high success rate from an extremely wide range of samples, typically using 25 mg of tissue. Tissue prepared by either the cryoPREP or .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 21 powermasher can be used as input for this method, and the resulting yields were consistently significantly greater than requirements for short read RNAseq (Illumina, San Diego, CA.). After isolation, any DNA remaining in the samples was removed using Turbo DNase (Thermo Fisher Scientific, UK) and RNA was checked for quality and quantity using the Qubit RNA Broad Range Assay kit (Thermo Fisher Scientific, UK) and the Nanodrop. This extraction method is ideal for a small number of samples, but the ergonomic issues and use of hazardous substances are prohibitive for scaling up. We therefore switched to the MirVana (Thermo Fisher Scientific, UK) bead-based extraction protocol [14] and reduced the amount of tissue input from 25 to 15 mg. All taxon groups score near to 100% extraction success, with the exception of protists at 92% pass rate (calculated from data in [4]), meaning a total RNA yield over the 100 ng input requirement for our standard library prep and sequencing process; Poly(A) RNA-Seq libraries constructed using the NEB Ultra II RNA Library Prep kit, following the manufacturer’s instructions, sequenced on the Illumina NovaSeq X instrument. For samples that fail, a different individual, different tissue type(s) and/or increased input amounts can be used in order to increase the RNA yield or quality obtained. Ultimately, RNA is highly dependent on the quality of the sample material provided and many failed extractions originate from samples not preserved in the ideal way. RNA extracted from several different organisms and tissue types using this method has also been successful for long read RNA sequencing with the Kinnex (Pacific Biosciences, Menlo Park, CA.) methodology. Taxonomic specific considerations Arthropods Small arthropod species are often preserved as whole individuals, requiring several individuals to complete the data required for an assembly (one for long read, one for Hi-C, one for RNAseq). Larger arthropods are partitioned into different tubes, e.g. head, thorax, and abdomen each in separate tubes. Arthropods have an extraction pass rate of almost 85% across 2374 arthropod species reported on here representing 1575 genera, 453 families, and 52 orders (Figure 8). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 22 Figure 8. Arthropod DNA extraction success metrics by order. The bar chart summarises the DNA extraction success per species across Arthropod orders. The results are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - DNA sufficient for sequencing with PacBio Ultra Low Input achieved; Pooling - two or more DNA extractions were pooled to meet QC threshold; Fail - extractions have failed to provide DNA of sufficient quality or quantity to proceed. The results represent the best DNA extraction outcome per species, determined using the hierarchy: Pass > Pass ULI > Pooling > Fail. The number inside each bar represents the percentage of species that have passed extraction within the orders, including Pass, Pass ULI and Pooling categories. To account for the wide range in species counts, a logarithmic scale is used, and orders with fewer than five species are excluded from this visualisation but are available in the supplementary material (Figure S1). Whilst arthropods usually perform well at extraction they are not without challenges. One challenge is the disruption of small organisms with chitinous exoskeletons, such as Amphipoda where 71% of 24 species have failed extraction (Figure 8). Initial work to apply the bead beating homogenisation method for these samples looks promising. Another challenge is small body size resulting in the most common failure being low HMW DNA yield. If the genome size is appropriate, these samples can be successful with ULI. However, ULI does not work well on its own for small organisms with large genomes. Jumping spiders (Salticidae) are an example of such a group, with small sized bodies, typically 10-15 mg but ranging down to 3 mg, and large genomes of up to 10 Gb. A modified extraction protocol [28] with reduced volumes of buffers was successful in improving DNA yields per specimen. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 23 Isopods are another example of species with typically small body size (<1 cm) and large genomes (e.g. Oniscus asellus 8.4 Gb). When tissue is restricted due to the organism size, and the DNA sequences poorly, as observed for isopods, reaching sufficient coverage is challenging. Currently, this challenge is being addressed through combining LI and ULI library types. This strategy minimises the impact of amplification biases present in the ULI data, as regions of drop out are likely to be compensated by presence in the LI data. The libraries produced from the ULI approach tend to sequence very well, as the DNA has been amplified. The abundance of DNA in relation to any inhibitors present is also changed in favour of high sequencing yields. Arthropod DNA tends to perform well in fragmentation processes as shown by the high pass rates (Figure 9). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 24 Figure 9. Arthropod fragmentation success by order. The bar chart summarises the DNA fragmentation success per species across Arthropod orders, subjected via the LI (left) or ULI (right) submission types. Species progressed under both submission types are included in both bars. The results represent the best DNA fragmentation outcome per species and submission type, determined using the hierarchy: Pass > Fail. The number inside each bar represents the percentage of species that have passed extraction within the orders. To account for the wide range in species counts, a logarithmic scale is used, and orders with fewer than five species are excluded from this visualisation but are available in the supplementary material (Figure S2). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 25 Plants Plant samples are typically fresh leaf material that has been collected in relative abundance, supported by the use of 7.6 mL tubes. Tissue availability is usually not a limiting factor for plants, other than particular taxonomic groups such as the Bryophytes. We use the ‘Plant MagAttract’ [29] protocol routinely for DNA extraction protocol from all plant species (Figure 1). It is efficient at extracting HMW DNA from a wide range of species to an extent adequate for long-read sequencing. Plant samples that fail to produce sequenceable HMW DNA from the Plant MagAttract extraction protocol are processed through the Plant Organic Extraction (POE) protocol [30]. Species extracted with the Plant MagAttract v.4 protocol can result in a poor DNA profile that is significantly improved when the same species is extracted with the POE protocol (Figure 10). The POE protocol is mid-throughput and requires more time and expertise in the laboratory, and for this reason it is employed only as a second-measure attempt for recalcitrant species. Work is underway to identify prior to extraction which species would most benefit from proceeding directly to the POE extraction method. Figure 10. Overlaid FemtoPulse molecular weight profiles of two DNA extractions from Thymus drucei (Lamiales; wild thyme) with two different DNA extraction protocols. The trace from the Plant MagAttract extract from 67 mg tissue shows primarily a wide LMW peak at 5.8 kb, whereas the POE protocol extract from 65 mg tissue shows a strong peak at 110 kb with slight smear down to 1.8 kb and further strong peak in the ultra HMW zone (>200 kb). We have processed 998 species covering 63 orders, 166 families and 564 genera within the Plant taxonomic group, including 927 vascular (Tracheophyta and Streptophyta) and 71 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 26 non-vascular (Bryophyta and Marchantiophyta) species. High success rates have been observed across the plant orders with 91% of all species extracted having passed through to subsequent processes. However, many of the successfully extracted plant species have failed at later stages including fragmentation, library preparation, or sequencing. Plant species that have failed twice at any of these points have been selected and processed through the POE protocol. The switch to use of the POE protocol is clearly of benefit to some groups, such as the Saxifragales, where the pass rate changes from 56% with MagAttract to 100% with POE (Figure 11). Fragmentation of DNA extracted from plant material is usually not problematic, with a success rate of 82%, calculated using unique species regardless of the submission protocol. For species for which both protocols have been used, the LI and ULI submission protocols have had a success rate of 80% and 93% respectively (calculated from data in [4]). Bryophytes have been challenging due to their low tissue availability as an individual is often <15 mg, whereas the usual input for plant DNA extraction methods is 50 mg. Modification of protocols to minimise tissue loss and maximise DNA recovery have been successful, coupled with the ULI library prep method as the genome sizes are often <1 Gb. A number of plant species remain challenging, with neither MagAttract nor POE DNA extraction protocols providing the required DNA yield or quality for long read sequencing. A pre-lysis hypertonic sorbitol wash has been developed [31] to remove interfering chemical contaminants present within the cytosol of plant specimens prior to lysis. Sorbitol is an osmotically active sugar alcohol capable of ‘drawing out’ the cytosol of plant tissues homogenates without interrupting the nuclear membrane. When a sorbitol wash is successful, a previously recalcitrant sample’s lysate should be absent of both viscosity, browning or other unfavourable characteristics. Initial results have shown that this protocol can significantly improve the quality of DNA extractions, and the CCS yield. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 27 Figure 11. Plant MagAttract and Plant Organic Extraction (POE) DNA extraction success metrics by order. The bar chart summarises the DNA extraction success per species across Plant orders, extracted via MagAttract or Plant Organic Extraction protocols. Species extracted with both protocols are included in both bars. Species extracted with protocols other than Plant MagAttract or Plant Organic Extraction are not represented. The results are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - DNA sufficient for sequencing with ultra low input achieved; Pooling - two DNA extractions were pooled to meet QC threshold; Fail - extractions have failed to provide sufficient DNA to proceed. The results represent the best DNA extraction outcome per species and extraction protocol, determined using the hierarchy: Pass > Pass ULI > Pooling > Fail. The number inside each bar represents the percentage of species that have passed extraction within the orders, including Pass, Pooling and Pass ULI categories. To account for the wide range in species counts, a logarithmic scale is used, and orders with fewer than five species are excluded from this visualisation but are available in the supplementary material (Figure S3). Fungi (including Lichens) Most fungal samples received had been cultured from samples collected in the field. The samples arrived as cell pellets with low tissue mass and size presenting challenges for DNA extraction. Mycelium samples can also be challenging due to low density of nuclei in the tissue. DNA extractions for these fungi have been of consistently low yield, and often of low quality in terms of fragment length resulting in a high ‘Fail’ rate (Figure 12). On occasions that high yields have been achieved, the sequencing of these samples has been poor. Given the typically small genome size of fungi (typically ~40 Mb) and the relatively low yields .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 28 achieved, the ULI library prep method has been the standard option for fungi samples. The ideal amount of DNA to start this process is 100 ng, although samples with >25 ng of DNA within the ULI fragment range are progressed. The optimised automated Plant Magattract protocol (v.4) [29] has provided increased DNA yield and improved FemtoPulse profile and is therefore now the protocol of choice. Post-extraction, the majority of fungi samples are directed toward the g-TUBE fragmentation method [26], this is an efficient and effective process resulting in a high pass rate for unique fungi species of 79% (calculated from data in [4]). The amplification in the ULI library generation process is being utilised here to aid sequencing, rather than accounting for a very low DNA input amount, as native fungal DNA often produces poor sequencing yields. To optimise the amplification process for this purpose, we are currently exploring a reduction in the number of PCR cycles and trialling different enzymes. Figure 12. Fungi DNA Extraction success metrics by order The bar chart summarises the DNA extraction success per species across Fungi orders. The

Results

are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - DNA sufficient for sequencing with ultra low input achieved; Pooling - two DNA extractions were pooled to meet QC threshold; Fail - extractions have failed to provide sufficient DNA to proceed. The results represent the best DNA extraction outcome per species, determined using the hierarchy: Pass > Pass ULI > Pooling > Fail. The number inside each bar represents the percentage of species that have passed extraction in any way, not those that failed. To account for the wide range in species counts, a logarithmic scale is used, and orders with fewer than five species are excluded from this visualisation but are available in the supplementary material (Figure S4). Chordates The routine processing of chordates is highly efficient, resulting in a 96% pass rate for species at DNA extraction (Figure 6). The DNA extraction status for orders within the group reveals that this success is general, with no clear trends (Figure 13). Chordates were .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 29 processed at 89% pass rate (calculated from data in [4]) at DNA fragmentation, when results for both protocols (g-TUBE and Megaruptor) are combined at the species level. The collection of chordate samples is legally and ethically challenging, and due to this, a significant number of samples from chordate species are provided from specimens that have been found dead, or small tissues collected from live individuals. The use of preservative solutions in lieu of snap-freezing is common for chordate samples. To ensure proper fixation it is recommended that a 1:10 ratio of tissue:fixative volumes is used. Ear punches and punch biopsies are a very useful form of non-lethal sample collection for chordate species; though not the most successful for DNA extraction, disruption via the cryoPREP helped maximise DNA yield and quality. Fish and bird blood are amongst the best performing tissues for HMW DNA extraction and are processed using the Nanobind HMW DNA extraction - nucleated blood protocol [32]. This manual protocol requires inputs ranging from 5-25 µl of nucleated blood, flash frozen or stored in ethanol at -80°C, from birds, fish or amphibians, and yields around 10-40 µg of HMW DNA. An automated version of this protocol [33] permits high throughput extraction of nucleated blood samples. Samples collected at necropsy often contain only degraded DNA. In these situations, it is possible to extract and then perform a stringent 0.45X SPRI to remove any remaining RNA or LMW DNA, and progress directly to library preparation and sequencing, bypassing shearing, if the fragment size profile is already degraded. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 30 Figure 13. Chordate DNA Extraction success metrics by order The bar chart summarises the DNA extraction success per species across Chordata orders. The results are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - DNA sufficient for sequencing with ultra low input achieved; Pooling - two DNA extractions were pooled to meet QC threshold; Fail - extractions have failed to provide sufficient DNA to proceed. The results represent the best DNA extraction outcome per species, determined using the hierarchy: Pass > Pass ULI > Pooling > Fail. The number inside each bar represents the percentage of species that have passed extraction in any way, not those that failed. To account for the wide range in species counts, orders with fewer than five species are excluded from this visualisation but are available in the supplementary material (Figure S5). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 31 Protists This taxonomic group poses a unique challenge due to the diversity of the species it contains, from microalgae to dinoflagellates, the nature of cell walls and exoskeletons, and the relative size of both individuals and genomes [34]. Protist samples have typically been provided as cell pellets from cultured strains. Because of the diversity of culture conditions required by different species this results in pellets with a wide range in mass and cell number per mg weight. This diversity makes it hard to standardise input amounts for DNA extraction. Although not yet fully optimised, the currently preferred process begins with a cell pellet of 50 mg, disrupted with the cryoPREP. DNA is extracted using the Plant MagAttract v4 extraction protocol [29,35]. The use of the cryoPREP provides increased yields compared to power mashed samples, and the adoption of Plant MagAttract v4 lysis also increases yield due to the better lysis of cell wall structures present in many protists and microalgae. These measures have resulted in an overall success rate in extraction for protists of 83% (Figure 6) with an uneven distribution between orders (Figure 14). Figure 14. Protist DNA extraction success metrics by order The bar chart summarises the DNA extraction success per species across Protist orders. The results are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - DNA sufficient for sequencing with ultra low input achieved; Pooling - two DNA extractions were pooled to meet QC threshold; Fail - extractions have failed to provide sufficient DNA to proceed. The results represent the best DNA extraction outcome per species, determined using the hierarchy: Pass > Pass ULI > Pooling > Fail. The number inside each bar represents the percentage of species that have passed extraction in any way, not those that failed. To account for the wide range in species counts, orders with fewer than five species are excluded from this visualisation but are available in the supplementary material (Figure S6). When species are identified as being cultured in axenic or low bacteria conditions by sample providers and with a predicted genome size of below 1 Gb we are generally able to .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 32 assemble genomes with ULI sequencing and Hi-C data. Protist samples predominantly progress toward the ULI route, achieving a success rate of 83%. In contrast, we have lower success rates (57%) with LI shearing protocols (calculated from data in [4]). Overall, our combined success rate is 74%, with future work aiming towards improvement of the ULI pipeline. The unusual genome structure of some protist species, for example ciliates, provide an extra challenge for fragmentation and size selection. Chromosomes are present in the range of 5 kb to 20 kb and these would be removed using current size selection protocols, future methods to efficiently sequence these fragments may include fractionation of DNA extracts. Many protists feed on bacteria, or require their presence for growth, and for this reason samples can be a mixture of protist and bacterial cells in culture. Sequencing yields for protist ULI samples may be very good, achieving over 24 Gb per Sequel IIe cell. However, up to 99% of these reads can originate from co-cultivated bacteria within the sample rather than the target protist. The importance of working with collectors to reduce this bacterial load is therefore fundamental to the progression of protist samples. Future work in this area will focus on assessment of dual extraction protocols, aiming to extract easily lysed organisms and remove this DNA in a first pass, followed by a stronger chemical or physical cell lysis and DNA extraction for the remaining sample. Other Metazoa and Macroalgae The paraphyletic group of “other metazoa and macroalgae” contains a multitude of different phyla, predominantly a mix of marine and terrestrial invertebrates, but also including macroalgae. This grouping is largely based on the focus of the species collectors, and their access to species whilst sampling in marine environments. The large diversity of the species within this polyphyletic grouping provides many challenges and opportunities for new developments. Samples within this group are homogenised using either cryoPREP or powermashing, based on the weight of the tissue available as described in the standard guidelines [16] and then subjected to the automated MagAttract extraction process [24]. Species are matched with ideal extraction protocols with an overall pass rate of 79% (Figure 6). This is relatively low compared with other taxonomic groups and is not spread evenly, with for example Mollusca showing a high success rate of 87%, whereas Platyhelminthes only achieve 29% (Figure 15). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 33 The results of fragmentation processes for the orders contained in other metazoa show a pass rate of 74% (calculated from data in [4]). ULI is a useful option when fragmentation

Results

are poor as many of the fragmentation failures are associated with a low yield or a poor profile. ULI libraries dominate for the majority of orders, particularly Cnidaria where 54 species have been processed via LI and 108 via ULI library prep. Mollusca are the exception for this trend, with 186 species processed for LI and only 66 for ULI. Figure 15. Other metazoa and macroalgae extraction success metrics by taxon group. The bar chart summarises the DNA extraction success per species across other metazoa and macroalgae taxon groups. The results are categorised as: Pass – DNA sufficient for sequencing achieved; Pass ULI - DNA sufficient for sequencing with ultra low input achieved; Pooling - two DNA extractions were pooled to meet QC threshold; Fail - extractions have failed to provide sufficient DNA to proceed. The results represent the best DNA extraction outcome per species, determined using the hierarchy: Pass > Pass ULI > Pooling > Fail. The number inside each bar represents the percentage of species that have passed extraction in any way, not those that failed. To account for the wide range in species counts, a logarithmic scale is used, and taxon groups with fewer than five species are excluded from this visualisation but are available in the supplementary material (Figure S7). Molluscs While many molluscs pass via the routine MagAttract protocol [20,24], the Nanobind [36]

Method

is used as the second option for those that fail (Figure 1). An example of this is Colus gracilis (Gastropoda; Graceful whelk), which failed consistently for DNA quality when extracted using the MagAttract Protocol [24] (Figure 16). However, when processed using the Nanobind protocol [36] DNA with a high molecular weight peak and a profile suitable for sequencing was obtained. The Nanobind extraction also increased the overall yield tenfold. This result may be conflated by a difference in the tissue preparation for these protocols, as .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 34 the MagAttract samples were disrupted in the cryoPREP (Covaris, Woburn, MA) whilst the Nanobind samples were finely diced with a scalpel as per the protocol. Figure 16. FemtoPulse profiles of the mollusc Colus gracilis (Gastropoda) DNA extracts following different protocols The samples extracted using the automated MagAttract protocol (Yellow and Red) yielded only LMW DNA and are not suitable for progression. The results from the Nanobind protocol (Black) show a significant improvement, both in the abundance of HMW DNA and also the absence of LMW DNA. After fragmentation and clean up of the Nanobind extracted DNA, the resulting peak fragment size of 18 kb (Blue) was ideal for progression to library prep. Cnidaria Cnidaria samples have also proved challenging, with corals causing difficulties during sample homogenisation, and jellyfish yielding low quality and quantity DNA following routine DNA extraction. One of the big challenges of extracting DNA from corals has been in disrupting hard stony corals into a fine powder that facilitates extraction. The deployment of the Fast-prep96™ has enabled faster and more complete disruption of coral tissue via a scalable method using 4 ml polycarbonate vials and a single 6mm zirconium oxide grinding bead, and otherwise following the plant bead beating protocol [19]. This approach to sample disruption has improved the MagAttract extraction success of hard corals, resulting in an increase in samples that yielded DNA suitable for LI or ULI sequencing. The bead-beaten coral samples have also been successfully used for Hi-C cross linking and subsequent library preparation, completing the data set required for reference level genome assembly. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 35 For salps and jellyfish, a new extraction protocol was developed, using the recommended lysis steps of the Omega Bio-Tek E.Z.N.A. Mollusc and Insect DNA kit (Item: D3373-00S from Omega Bio-Tek, Norcross, GA.) combined with the SpeedBead-based extraction

Method

used in the POE protocol, which generated a higher quality and quantity of DNA. This protocol required inputs of 100 - 200 mg of fresh frozen tissue; lower input amounts and ethanol preserved tissues could also be used, however the resulting DNA yield may be lower. The DNA extracted using the Modified Omega Bio-Tek E.Z.N.A. protocol [37] was suitable for either ULI or LI sequencing, enabling the reference level assembly of multiple salp and jellyfish genomes. Porifera Initially, processing through the routine protocols of cryoPREP [18] and MagAttract v2 [38] yielded a significant portion of LMW DNA within the extract. Research into homogenisation

Methods

that have been used identified the “squeeze” method [27], which aims to maintain the integrity of the sponge cells whilst removing them from their skeletons (siliceous and calcareous spicules embedded in collagenous protein matrices). Samples of Eunapius fragilis (a freshwater demosponge) extracted with and without “squeezing” showed significantly increased yields of high molecular weight DNA in the squeezed sample. The cells separated via the squeeze method have also been successfully used to generate Hi-C data, so all Porifera are now processed using the squeeze method. Macroalgae Initial work with macroalgal samples (Chlorophyta, Ochrophyta and Rhodophyta) began with tissue disruption via the cryoPREP [18] followed by the POE protocol [29], yielding DNA that was of both poor quality and quantity. The samples were characterised by their tendency to become very viscous upon cell lysis, forming a gel like substance in the tube during extraction which significantly hindered further processing. This is due to the large polysaccharide content of the algae. The typical approach for macroalgae with a genome size <1 Gb is therefore the ULI library prep method after the POE extraction method with a lower tissue input of 25mg in order to reduce the amount of contaminants in the sample.

Conclusion

The Sanger Tree of Life programme has scaled reference genome assembly production and has released over 2000 chromosomally-resolved genome reference assemblies as of February 2025. We aim to further increase genome production year on year, and standardisation, refinement and streamlining of laboratory processes has been fundamental for our continual improvements. The homogenisation methods, extraction protocols, and .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 36 shearing processes discussed here are enabling genome assemblies from a great diversity of species. In addition to release of the data freeze used to produce the summary statistics presented here [4], in order to further assist others working in the field, we have also made the raw data from the Tree of Life laboratory work available via a searchable online ‘Portal’ at links.tol.sanger.ac.uk/datasets/tol-lab-data. This link is continuously updated with the work underway and thus provides access to laboratory information as soon as the work has been completed. We hope this will be useful to examine both the details behind the summaries offered in this paper but also to explore the protocols used on future samples. For example, if a researcher is working on a challenging species that is a close relative of a species that has come through the Tree of Life, the portal could be explored to understand which HMW DNA extraction protocols worked or did not work and thus save time in testing a variety of approaches. Alternatively, where a researcher has access to multiple tissue types for work, the Portal may provide information as to how related species and tissue types have performed in extraction and downstream sequencing, informing decision making. Our experience shows that building high quality reference genome assemblies is achievable for the majority of species that have been collected alive and preserved using best practice (snap freezing in most cases), and have a suitable tissue availability-to-genome size ratio. Challenges remain in certain taxonomic areas, especially for species with large genomes and small body sizes. The new Ampli-Fi option from PacBio requires only 1 ng of sheared DNA to provide data for up to 3 Gb genome size and may help overcome some of these challenges. Best practice is to avoid amplification whenever possible, and even here, the requirements for input DNA amounts are regularly decreasing with a recent four-fold decrease in the amount of DNA needed for LI PacBio. The Sanger Tree of Life core lab biobanks all DNA aliquots including those that did not meet the quality and yield required to progress to sequencing at the time of their extraction. With these recent advances, we will now return to biobanked DNA extracts that previously did not meet required yields as sequencing these to sufficient coverage may now be achievable. For species with picogram level DNA content, phi29 replicase amplification can be used on single meiofaunal organisms to generate long-insert library DNA. Picogram input Multimodal Sequencing (PiMmS) [23] delivers a PacBio or ONT long-read compatible amplified DNA sample and full length cDNA from a single specimen. This has proven successful for a number of species [39], and work will continue to standardise and ramp up the use of this type of method. Future work will continue to focus on the species that fail at different stages of the process, to develop and implement methods for the processing of smaller samples with as little .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 37 amplification as possible, and to explore the merits of different sequencing technologies. We will continue to share our protocols and findings as soon as possible in the hope that global biodiversity genomics efforts might benefit.

Acknowledgements

All authors as well as the laboratory work discussed above were funded by the Wellcome Sanger Institute Quinquennial Review award 2021-2026 to the Wellcome Sanger Institute (220540/Z/20/A). In addition, the majority of genome production for species among the first 2000 discussed here was supported by Wellcome through the Darwin Tree of Life Discretionary Award (218328) and by the Gordon and Betty Moore Foundation through the Aquatic Symbiosis Genomics Project (Grant ID: GBMF8897, https:/ /doi.org/10.37807/GBMF8897). We thank the many hundreds of people who collected and identified species on behalf of the Darwin Tree of Life and Aquatic Symbiosis Genomics Projects, and the many colleagues in these projects who shared their best methods with us. We also thank the staff of the Wellcome Sanger Institute Scientific Operations teams who contributed to extractions, and conducted library preparation and sequencing. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 38

References

1. Lewin HA, Richards S, Lieberman Aiden E, Allende ML, Archibald JM, Bálint M, et al.. The Earth BioGenome Project 2020: Starting the clock. Proc Natl Acad Sci U S A. Proceedings of the National Academy of Sciences; 119:e21156351182022; 2. Darwin Tree of Life Project Consortium. Sequence locally, think globally: The Darwin Tree of Life Project. Proc Natl Acad Sci U S A. Proceedings of the National Academy of Sciences; 119:e21156421182022; 3. Victoria McKenna, John M. Archibald, Roxanne Beinart, Michael N. Dawson, Ute Hentschel, Patrick J. Keeling, Jose V. Lopez, José M. Martín-Durán, Jillian M. Petersen, Julia D. Sigwart, Oleg Simakov, Kelly R. Sutherland, Michael Sweet, Nicholas J. Talbot, Anne W. Thompson, Sara Bender, Peter W. Harrison, Jeena Rajan, Guy Cochrane, Matthew Berriman, Mara K.N. Lawniczak, Mark Blaxter: The Aquatic Symbiosis Genomics Project: probing the evolution of symbiosis across the Tree of Life[version 2; peer review: 1 approved, 1 approved with reservations]. https://wellcomeopenresearch.org/articles/6-254 Accessed 2025 Feb 21. 4. Howard C, Denton A, Jackson B, Bates A, Jay J, Yatsenko H, et al.. Supplementary data for: “On the path to reference genomes for all biodiversity: lessons learned and laboratory protocols created in the Sanger Tree of Life core laboratory over the first 2000 genomes.” Zenodo; 5. : Bring structure to your research. protocols.io. https://www.protocols.io/ Accessed 2024 Oct 2. 6. : Tree of Life at the Wellcome Sanger Institute - research workspace on. protocols.io. https://www.protocols.io/workspaces/wellcome-sanger-institute13 Accessed 2025 Mar 13. 7. : The R Project for Statistical Computing. https://www.r-project.org/ Accessed 2024 Dec 12. 8. : Tableau: Business intelligence and analytics software. Tableau. https://www.tableau.com/en-gb Accessed 2025 Feb 24. 9. Challis R, Kumar S, Sotero-Caio C, Brown M, Blaxter M. Genomes on a Tree (GoaT): A versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic tree of life. Wellcome Open Res. F1000 Research Ltd; 8:242023; 10. Strickland M, Cornwell C, Howard C. Sanger Tree of Life Fragmented DNA clean up: Manual SPRI. protocols.io. 2023; doi: 10.17504/protocols.io.kxygx3y1dg8j/v1. 11. Oatley G, Sampaio F, Howard C. Sanger Tree of Life Fragmented DNA clean up: Automated SPRI. protocols.io. 2023; doi: 10.17504/protocols.io.q26g7p1wkgwz/v1. 12. DeAngelis MM, Wang DG, Hawkins TL. Solid-phase reversible immobilization for the isolation of PCR products. Nucleic Acids Res. Oxford University Press (OUP); 23:4742–31995; 13. do Amaral RJV, Cornwell C, Howard C. Sanger Tree of Life RNA Extraction: Manual TRIzolTM. protocols.io. 2023; doi: 10.17504/protocols.io.yxmvm334nl3p/v1. 14. do Amaral RJV, Bates AAB, Denton A, Yatsenko H, Jay J, Howard C. Sanger Tree of Life RNA Extraction: Automated MagMaxTM mirVana. protocols.io. 2023; doi: 10.17504/protocols.io.6qpvr36n3vmk/v1. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 39 15. : Report on sample collection and processing standards. Earth BioGenome Project. https://www.earthbiogenome.org/sample-collection-processing-standards Accessed 2024 Oct 16. 16. Jay J, Yatsenko H, Narváez-Gómez JP, Mbye H, Morra M, Strickland M, et al.. Sanger Tree of Life Sample Preparation: Triage and Dissection. protocols.io. 2023; doi: 10.17504/protocols.io.x54v9prmqg3e/v1. 17. Denton A, Oatley G, Cornwell C, Quail M, Howard C. Sanger Tree of Life Sample Homogenisation: PowerMash. protocols.io. 2023; doi: 10.17504/protocols.io.5qpvo3r19v4o/v1. 18. Narváez-Gómez JP, Mbye H, Oatley G, Strickland M, Park N, Howard C. Sanger Tree of Life Sample Homogenisation: Covaris cryoPREP® Automated Dry Pulverizer. protocols.io. 2023; doi: 10.17504/protocols.io.eq2lyjp5qlx9/v2. 19. Jackson B, Howard C. Sanger tree of life sample homogenisation: Cryogenic bead beating of plants with FastPrep-96. protocols.io. 2023; doi: 10.17504/protocols.io.rm7vzxk38gx1/v1. 20. Strickland M, Moll R, Cornwell C, Smith M, Howard C. Sanger Tree of Life HMW DNA Extraction: Manual MagAttract. protocols.io. 2023; doi: 10.17504/protocols.io.6qpvr33novmk/v1. 21. Sheerin E, Sampaio F, Oatley G, Todorovic M, Strickland M, do Amaral RJV, et al.. Sanger Tree of Life HMW DNA Extraction: Automated MagAttract v.1. protocols.io. 2023; doi: 10.17504/protocols.io.x54v9p2z1g3e/v1. 22. Todorovic M, Howard C. Sanger Tree of Life HMW DNA Extraction: Manual Plant MagAttract v.1. protocols.io. 2023; doi: 10.17504/protocols.io.n92ldmmx9l5b/v1. 23. Laumer C. Picogram input multimodal sequencing (PiMmS). protocols.io. 2023; doi: 10.17504/protocols.io.rm7vzywy5lx1/v1. 24. Oatley G, Denton A, Howard C. Sanger Tree of Life HMW DNA Extraction: Automated MagAttract v.2. protocols.io. 2023; doi: 10.17504/protocols.io.kxygx3y4dg8j/v1. 25. Bates AAB, Clayton-Lucey I, Howard C. Sanger Tree of Life HMW DNA Fragmentation: Diagenode Megaruptor®3 for LI PacBio. protocols.io. 2023; doi: 10.17504/protocols.io.81wgbxzq3lpk/v1. 26. Oatley G, Sampaio F, Kitchin L, do Amaral RJV, Howard C. Sanger Tree of Life HMW DNA Fragmentation: Covaris g-TUBE for ULI PacBio. protocols.io. 2023; doi: 10.17504/protocols.io.q26g7pm81gwz/v1. 27. Lopez J. Squeeze” enrichment of intact cells (eukaryotic and prokaryotic) from marine sponge tissues prior to rou. protocols.io. 2022; 28. Denton A, Thomas A, Howard C. Sanger Tree of Life HMW DNA extraction: Automated MagAttract for small arthropods v1. protocols.io. 29. Jackson B, Howard C. Sanger Tree of Life HMW DNA Extraction: Automated Plant MagAttract v.4. protocols.io. 2023; doi: 10.17504/protocols.io.8epv5xrd5g1b/v1. 30. Jackson B, Howard C. Sanger Tree of Life HMW DNA Extraction: Plant Organic HMW gDNA Extraction (POE). protocols.io. 2023; doi: 10.17504/protocols.io.3byl4qq4zvo5/v1. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint 40 31. Jackson B, Howard C. Sanger Tree of Life HMW DNA Extraction: Hypertonic Washing of Plant Tissue Homogenates. protocols.io. 2024; doi: 10.17504/protocols.io.yxmvm9n36l3p/v1. 32. Denton A, Oatley G, Biosciences P, Howard C. Sanger Tree of Life HMW DNA Extraction: Manual Nucleated Blood Nanobind®. protocols.io. 2023; doi: 10.17504/protocols.io.5jyl8p2w8g2w/v1. 33. Biosciences P, Bates A, Denton A, Howard C. Sanger Tree of life HMW DNA extraction: Automated nucleated blood Nanobind® v1. protocols.io. 34. LaJeunesse TC, Lambert G, Andersen RA, Coffroth MA, Galbraith DW. SYMBIODINIUM (PYRRHOPHYTA) GENOME SIZES (DNA CONTENT) ARE SMALLEST AMONG DINOFLAGELLATES1. J Phycol. Wiley; 41:880–62005; 35. Jackson B, Howard C. Sanger Tree of Life HMW DNA Extraction: Manual Plant MagAttract v.4. protocols.io. 2023; doi: 10.17504/protocols.io.261ged5k7v47/v1. 36. Biosciences P, Bates AAB, Howard C. Sanger Tree of Life HMW DNA Extraction: Manual Mollusc Nanobind®. protocols.io. 2023; doi: 10.17504/protocols.io.14egn36nyl5d/v1. 37. Denton A, Howard C. Sanger Tree of Life HMW DNA extraction: Modified Omega Bio-Tek E.Z.N.A.® v1. Protocols.io. 38. Todorovic M, Oatley G, Howard C. Sanger Tree of Life HMW DNA Extraction: Automated Plant MagAttract v.2. protocols.io. 2023; doi: 10.17504/protocols.io.36wgq3n13lk5/v1. 39. Stevens L, Martínez-Ugalde I, King E, Wagah M, Absolon D, Bancroft R, et al.. Ancient diversity in host-parasite interaction genes in a model parasitic nematode. Nat Commun. Nature Publishing Group; 14:77762023; .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 11, 2025. ; https://doi.org/10.1101/2025.04.11.648334doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (sparse)

Too few in-corpus citations on either side for a chart; here are the lists.

Cited by (2)

Cited by (2)

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-21T05:10:58.409756+00:00
License: CC-BY-4.0