De novo assembly of plasmodium interspersed repeat (pir) genes from Plasmodium vivax RNAseq data suggests geographic conservation of sub-family transcription

doi:10.21203/rs.3.rs-5822769/v1

De novo assembly of plasmodium interspersed repeat (pir) genes from Plasmodium vivax RNAseq data suggests geographic conservation of sub-family transcription

2025 · doi:10.21203/rs.3.rs-5822769/v1

preprint OA: closed

Full text JSON View at publisher

Full text 198,523 characters · extracted from preprint-html · click to expand

De novo assembly of plasmodium interspersed repeat (pir) genes from Plasmodium vivax RNAseq data suggests geographic conservation of sub-family transcription | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article De novo assembly of plasmodium interspersed repeat (pir) genes from Plasmodium vivax RNAseq data suggests geographic conservation of sub-family transcription Timothy S. Little, Deirdre A. Cunningham, George K. Christophides, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5822769/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 29 May, 2025 Read the published version in BMC Genomics → Version 1 posted 8 You are reading this latest preprint version Abstract Background : The plasmodium interspersed repeats ( pir ) multigene family is found across malaria parasite genomes, first discovered in the human-infecting species Plasmodium vivax , where they were initially named the vir s. Their function remains unknown, although studies have suggested a role in virulence of the asexual blood stages. Sub-families of the P. vivax pir/vir s have been identified, and are found in isolates from across the world, however their transcription at different localities and in different stages of the life cycle have not been quantified. Multiple transcriptomic studies of the parasite have been conducted, but many map the pir reads to existing reference genomes (as part of standard bioinformatic practice), which may miss members of the multigene family due to its inherent variability. This obscures our understanding of how the pir sub-families in P. vivax may be contributing to human/vector infection. Results: To overcome the issue of hidden pir diversity from utilising a reference genome, we employed de novo transcriptome assembly to construct the pir ‘reference’ of different parasite isolates from published and novel RNAseq datasets. For this purpose, a pipeline was written in Nextflow, and first tested on data from the rodent-infecting P. c. chabaudi parasite to ascertain its efficacy on a sample with a full, genome-based set of pir gene sequences. The pipeline assembled hundreds of pir s from the studies included. By performing BLAST sequence identity comparisons with reference genome pir s (including P. vivax and related species) we found a clustered network of transcripts which corresponded well with prior sub-family annotations, albeit requiring some updated nomenclature. Mapping the RNAseq datasets to the de novo transcriptome references revealed that the transcription of these updated pir gene sub-families is generally consistent across the different geographical regions. From this transcriptional quantification, a time course of mosquito bloodmeals (after feeding on an infected patient) highlighted the first evidence of ookinete stage pir transcription in a human-infective malaria parasite. Conclusions: De novo transcriptome assembly is a valuable tool for understanding highly variable multigene families from Plasmodium spp ., and with pipeline software these can be applied more easily and at scale. Despite a global distribution, P. vivax has a conserved pir sub-family structure - both in terms of genome copy number and transcription. We suggest that this indicates important roles of the distinct sub-families, or a genetic mechanism maintaining their preservation. Furthermore, a burst of pir transcription in the mosquito stages of development is the first glint of ookinete pir expression for a human-infective malaria parasite, suggesting a role for the gene family at a new stage of the lifecycle. Malaria Vivax Transcriptomics Pir Multigene Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Introduction Plasmodium vivax is the most widely distributed species of Plasmodium that infects humans, causing recurrent malaria. Although globally the proportion of malaria cases attributable to P. vivax has been decreasing since 2000, P. vivax still accounts for approximately 50-80% of all human malaria cases in the Americas and most of Asia [1]. It has traditionally been thought that P. vivax was not a major cause of malaria in sub-Saharan African, however increasing evidence of seropositivity in African nations have led some to suggest that P. vivax in the continent is more endemic than common mantra suggests [2–4]. Given its prevalence around the world, P. vivax is a major focus of public health research. A feature of the genome of all Plasmodium species is the presence of large multigene families [5]. One of most expansive of these is the Plasmodium interspersed repeats ( pir s) family, which is present across the Plasmodium lineage [6]. This includes human-infective parasites such as P. vivax , P. malariae and P. ovale , simian-infective parasites such as P. cynomolgi, and the rodent-infecting Plasmodium species [6–8]. Sometimes the gene family is named differently depending on the species, such as the vir s in P. vivax or cir s in P. c. chabaudi , but they are all members of the pir gene family and are hereby referred to only as pir s. However, pir genes are not found in species within the subgenus Laverania (such as P. falciparum ). Unlike the var genes of P. falciparum , which are known to play a role in sequestration in host endothelium [9] and contribute to severe pathology [10], relatively little is known about the exact function or binding partners of pir genes. However, we have previously shown that they are associated with virulence and establishment of chronic infection in rodent-infecting species [11, 12], and others have demonstrated that the surface PIR protein is involved in infected red blood cell sequestration[13, 14]. P. ovale, P. malariae and P. yoelii possess over 1000 pir members in their genomes, although for other species the copy number can be much fewer, such as the 134 pir s identified in the rodent malaria parasite P . berghei [15–18] . The P. vivax P01 strain reference genome contains 1216 members of the pir gene family [19], and a recent genome assembly of Thai-origin W1 contains as many as 1145 predicted pir genes [20], thus constituting around a sixth of the total number of predicted genes for this organism. Using sequence clustering and phylogenomics, one can further divide this multigene family into sub-families, one clade of sub-families specific to the rodent-infective species, and another specific to simian/human-infective species [16]. A prominent exception to this is a relatively highly conserved pir sequence postulated to be the ancestral gene, which is found across in the genomes of the rodent-infective and simian-infective clades of Plasmodium sp. [21]. From P. vivax genomes sequenced across different geographic regions, it has been shown that there are differences not just in the pir repertoire of these isolates, but also in the proportions of pir sub-family members around the world [22, 23]. The dedication of a large part of the P. vivax genome to a single multigene family suggests an important role, and understanding this is of high importance. Although experimental P. chabaudi and P. berghei infections in mice provide ready access to RNA for studying the transcriptional patterns of the pir genes, human malarias, such as P. vivax , remain more challenging to investigate. Among the malaria parasite species, P. vivax genomes are particularly diverse between isolates, and this is heavily concentrated within multigene families such as the pir s [24].This presents an obstacle for transcriptomic studies, as pir s may be missed when mapping divergent isolates to a generic reference genome. One approach to obtain a more comprehensive overview of the number and variability of pir genes in P. vivax is to use de novo transcriptomic assembly on the available P. vivax RNAseq data, a method in which the sequenced reads are used for assembly of the RNA being transcribed, without need of a reference genome. De novo transcriptome assembly has previously been performed for Plasmodium species with missing or inadequate reference genomes, including the assembly of pir gene transcripts. For example, novel pir genes were identified in a P. yoelii nigeriensis de novo transcriptomic assembly [25]. Similarly, P. ovale and a limited number of P. vivax studies [26–28] assembled novel transcripts which were annotated as pir -like. All these studies demonstrated marked differences between the pir repertoires of the reference genomes and the de novo transcriptomes, suggesting that, without the new assemblies, some pir s would have been completely missed. Using de novo transcriptome assembly permits the identification of novel members of highly variable multigene families and assessment of their levels of expression during different life stages and environments, ultimately enabling deeper investigation into their functions. Here, we have analysed the transcription of pir s in published datasets of P. vivax RNAseq (see Table 1), using de novo transcriptome assembly to unlock a greater repertoire of the genes than those already existing in the reference genomes. We determined whether the pre-existing pir sub-family definitions accurately describe clusters of the de novo- generated genes. Once annotated, we questioned whether distinct sub-families are differently expressed across the parasite lifecycle and whether they vary between geographical locations. We compare the pir expression patterns between this human malaria parasite and rodent malaria species, which could suggest that the pir family has similar function between divergent species. Results De novo transcriptome assembly generates full-length pir genes from samples of Plasmodium chabaudi blood stages In order to assemble the de novo transcripts of P. vivax pir genes it was first necessary to determine the most effective assembly method. For this, three different assemblers were used on RNAseq datasets from published P. chabaudi chabaudi AS RNAseq reads [29].These were then combined to identify the best transcripts using EvidentialGene [30, 31] (Figure 1), assessing the outputs for both the number of known pir genes recovered and the number of conserved Plasmodium genes found among the transcripts (Benchmarking Universal Single-Copy Orthologs - BUSCO – a score of assembly quality [32]). There is no agreed threshold for a ‘good’ BUSCO score, so we aimed to get the highest scores that the data and software could produce. Among the individual tools, Spades-RNA [33] performed the best; however, superior results were achieved by combining the outputs of the three assemblers (see Supplementary Figure 1). EvidentialGene was chosen as the preferred method of combining the assemblies, compared to the simple concatenation of the three assemblies and performing duplicate removal. We then checked whether there was a threshold of transcription for successful pir assembly by comparing the TPM expression of each gene to the quality of the best transcripts identified for the assemblies (see Supplementary Figure 2). This demonstrated that even a small amount of transcription is enough for a pir to be constructed by the program, although quality is enhanced the more a gene is expressed, as expected. Only a few pir genes are both transcribed and not assembled to any degree, and these still are only found to be expressed at no higher than 100 TPM. Since EvidentialGene accomplished similar quality levels to simple assembly concatenation/de-duplication, but returned a smaller total number of transcripts, we suggest that it would be the better method to use for identifying unknown pir s with higher accuracy. De novo assembly of Plasmodium vivax pir genes Fourteen transcriptome datasets were available from P. vivax samples, from multiple geographical sources and life-cycle stages (Table 1). The de novo transcriptomes of P. vivax covered a range of BUSCO quality scores and numbers of pir genes. Using the assemblies from P.vivax blood stages, which were generally of high BUSCO quality (>= 50%), we identified transcription of up to 400 pir genes per isolate (See Figure 2). It was notable that the number of pir s detected does not plateau, even in the highest quality assemblies, so these assemblies were not reaching the total number of pir s expressed in these isolates. Some assemblies, particularly from RNA of liver-stage and many mosquito-stage samples, showed both poor BUSCO quality and a low total number of pir mRNA transcripts, highlighting the difficulty of obtaining enough parasite RNA for next-generation sequencing from samples containing only a small proportion of parasite material (see Figure 2A). However, there were assemblies from multiple sporozoite samples with acceptable BUSCO quality but still with no pir genes detected, suggesting that pir genes were not transcribed in sporozoites (see Figure 2B). The only mosquito stage samples which showed evidence of pir expression were from the blood-meal samples of ‘MosqPeru23’, which is discussed in more detail at the end of the results. From the asexual stages of P. vivax hundreds of pir transcripts have been successfully identified, demonstrating the feasibility of the de novo method to extract pir transcripts from the RNAseq reads themselves. De novo - assembled pir genes can be grouped into sub-families with pir s from P. vivax reference genomes and related species. The sub-families of the Plasmodium vivax pir genes were first defined by [8], and substantially updated by [7]. These sub-families have since been found throughout the pir repertoires of the P. vivax reference genomes P01, and W1 [19, 20], as well as other P. vivax genome assemblies and even some overlap with the pir s of other species [16]. We sought to determine whether these sub-families describe all of the sequence diversity present in the de novo P. vivax pir transcripts. To this end a BLAST similarity network [34] was constructed using the de novo sequences, the pir s of the Sal1, P01 and W1 P. vivax reference genomes, and the pir s of the reference genomes for the closely related species P. knowlesi , P. vivax-like , P. ovale, P. malariae, P. cynomolgi, P. coatneyi , and P. brasilianum [16, 35–40]. A highly interconnected network was produced; however, clusters could still be resolved (see Figure 4). The clustering algorithm MCL [41] was used to identify sub-family clusters in the pir gene similarity network (consisting of reference genome, de novo , and non- vivax pir s), and these were compared to the previously defined sub-families from P. vivax Sal1. Overall, the existing sub-family designations aligned well with the MCL clustering (see Figure 3). However, some clusters were not ascribed to any existing sub-families (e.g., cluster 10), and other existing sub-families were split between different groups (for example, sub-family E was split between clusters 1, 3, 7, 9, 15 and 19). Hence, an updated nomenclature was required. We propose simply naming the groups in order of size, appending the Sal1 name where members of the old sub-family were part of the cluster (new names shown as part of the BLAST network in Figure 4). As an example, the largest cluster, which has many E sub-family sequences from the Sal1 genome, was termed 1_E by this method. The fourth largest cluster, which overlaps with C sequences from Sal1, was called 4_C. This 4_C grouping contains the highly conserved ancestral pir sequence, which has orthologs across all species with canonical pir members in their genomes and tends to be particularly highly transcribed. The largest cluster without any assigned Sal1 sequence (although it did contain unassigned Sal1 sequences) was cluster 10, demonstrating further how well the old definitions cover newer assemblies of pir s. The P. vivax sub-families show overlap with the pir repertoires of other species (see Supplementary Figure 3), as previously observed [7, 42]. The smallest amount of cross-species sub-family sharing was with P. coatneyi and P. knowlesi , with only the 2_I (the largest sub-family in both P. coatneyi and P. knowlesi ), 4_C (including the ancestral sequence), and 8_I families being shared between them and P. vivax (see Figure 5). More P. vivax sub-families were shared with P. malariae and P. brasilianum , however almost all sub-families were shared between P. vivax , P. cynomolgi, P. ovale, and P. vivax-like . The currently available P. vivax-like genome, in contrast to every other species included, does not have representatives of group 8_I. Notably the sub-families 21_B, 17, 22, 23, and 25 were found in most of the P. vivax reference genomes but not the de novo pir s. It is unlikely that the pipeline is failing to capture these sub-families from the expression data, as running simulated RNAseq data of the reference genome pir s through the software did not suggest that these sub-families were systematically missed (see Supplementary Figure 4). This supports the assertion that these groups were expressed at low levels (if at all) and would not be found in RNAseq data. Using the larger repertoire of pir sequences obtained by de novo assembly we have refined the sub-family definitions for pir s from P. vivax and related species. Overall, the clusters match the previously defined reference sub-families with minor changes. Transcription of the de novo pir genes across geographies The new sub-family definitions were used to evaluate whether groups of P. vivax pir genes were differently transcribed across isolates from distant geographical locations. For this the RNAseq quantification (alignment and feature counting of the original reads to the de novo assembly) was compared. Broadly speaking, the alignment and quantification of the de novo pir transcription matched the results from the original studies, such as the pattern of transcription across the asexual cycle in Zhu et al., 2016 [28] (AsexThai16 - see Supplementary Figure 6), and the relative similarity of transcription between parasites obtained from two different species of splenectomised hosts in Gunalan at al., 2019 [43] (AsexSal119 - see Supplementary Figure 7). In Figure 6 we show the gene expression (TPM proportions) of the pir sub-families in each isolate (for samples with higher pir numbers, for the sake of visualisation), and how this relates to BUSCO score and geographical origin. Most pir sub-families were present across these samples, and the larger groups always have representatives in the de novo transcriptomes. At this level the main distinguisher between the samples was the number of pir sub-families present, as many with the lower BUSCO scores have pir sub-family gaps. The proportion of pir numbers demonstrates a similar pattern (see Supplementary Figure 5). Overall, the largest sub-family, 1_E, tends to dominate P. vivax expression profiles, followed by 4_C (the sub-family that includes the ancestral pir gene) and 9_E. The largest sub-family not defined in the reference Sal1 strain (‘new’ sub-families), cluster 10, did not appear to be transcribed very much at all. Other newly defined sub-families, 15 and 19, were transcribed more highly throughout the datasets, but still at a low level compared to the most highly expressed clusters. There were few clear associations between pir sub-families and geographical locations. Hierarchical clustering separated most of the Peruvian transcriptomes (yellow rows in figure 0.6), into a distinct group, with a higher proportion of the subfamilies 5_G, 2_I, and 8_I. However, this was not fully consistent across the Peruvian samples, and these sub-families were highly expressed in some Cambodian samples. Indeed, a negative binomial model, considering BUSCO score, sub-family membership, geographic origin, and stages of the life cycle, demonstrated that there was no statistically significant linear relationship between Peruvian origin and any pir sub-family. The only geographical associations that arose from these statistical tests (adjusted p < 0.05) were for sub-families transcribed at low levels, and the effect size was so small that the relevance was questionable. Hence, we concluded that transcription of sub-families was consistent across P. vivax isolates from disparate geographical regions. De novo pir assemblies suggest pir transcription at the ookinete stage The analysis of most mosquito samples indicated that there were few or no pir genes transcribed in this part of the parasite lifecycle. The few mosquito stages from which we assembled pir transcripts (the MosqPeru23 samples) were taken from mosquito midgut bloodmeals at different times after feeding on infected Peruvian patients. From this experiment we included three different isolates (originating from three individual patients) and these three de novo assemblies all had >50% Complete BUSCOs and included >20 predicted pir s. We investigated these samples in more detail by analysing how pir TPM changed across the time-course of the experiment. Given that the mosquito bloodmeal would initially include surviving asexual stages, the presence of pir transcripts could be explained as a holdover from these cells instead of the parasite’s specific mosquito stages (beginning with gametes). If the asexual stages were the explanation for this signal, then we would expect the gene expression to dominate earlier post-feed and then diminish over time. Instead, pir transcription was minimal during the initial time points, but dominated at 22 and 26 hours post-bloodmeal, although signal had vanished by 7 days (Figure 7). This was compatible with the time of ookinete development of P. vivax [44], providing evidence for the presence of pir transcripts in the mosquito stages of this species. Few of these mosquito-specific pir s had 99% BLAST matches to any reference genome pir s, hence mapping to the reference genomes may have missed these signals. Overall, the TPM expression was reasonably low (<40 TPM) but the peak was consistent across each of the three separate isolates (see figure 7 and Supplementary Figure 8). An alternative explanation for this observed timing of pir transcription could be that the later time points simply have a higher number of total reads, with the only pir s of the overall assembly being extracted from these reads and missed from lower-read samples. The total counts of reads aligned to the de novo transcriptomes, however, were not necessarily higher in the 22/26h time points compared to others (indeed, for isolate P2 the total number of aligned reads was lower for 22h than it was at the 1h or 6h time points), so this is unlikely to be the explanation. The highest expression was concentrated within 9_E pir s, so future research could be conducted to ascertain whether there is a role of this sub-family in parasite-mosquito interactions. Discussion The natural diversity of P. vivax and its pir multigene family make these loci particularly difficult to study. Gene expression studies, for example, are frustrated by the reference genomes for P. vivax only possessing some of the same pir sequences as wild isolates. To circumvent this problem and allow the study of pir gene expression from a diverse collection of P.vivax isolates, we employed de novo transcriptome assemblers to find the pir sequences from the RNAseq data themselves. A bespoke Nextflow pipeline [45] which employed the Trinity [46] and SPADES assemblers [33, 47], and combined them using EvidentialGene [30, 31], was successful in generating P. c. chabaudi pir transcripts, and therefore utilized in this study. Based on the reference genomes, we would expect an upper ceiling of ~1000 pir s from the assembly of each isolate, although the highest number of recoverable transcripts will be lower as some of these pir s will not have enough coverage in the read data. With our pipeline, we found up to ~400 pir transcripts expressed by any one isolate, thousands in total across the data, covering a range of life cycle stages, geographical sources, and pir sub-families. From sequence similarity networks of these de novo transcripts, we refined the pir sub-family definitions and demonstrated that they are generally transcribed in similar patterns in isolates worldwide. Gene expression of the pir s across the life cycle re-affirmed a burst of transcription in the mosquito stages of development previously seen in P. berghei , suggesting a role for this gene family in the vector of human-infecting parasites. Using the BUSCO [32] score to quantify the number of Plasmodium sp. -conserved transcripts in each assembly (as a metric of overall quality), it was clear that the numbers of pir transcripts being produced was related strongly to the completeness of the overall de novo transcriptome. The numbers of transcribed pir s showed no evidence of plateauing even with high-quality assemblies, so it is likely that the total P. vivax expressed pir repertoire was underestimated. Hundreds of transcribed pir s may not be assembled by this method either due to limitations of the data and tools, and/or because the missing pir s themselves were transcribed at too low a level to be picked up. Pir s could be missed by the hidden Markov model or be excluded by the model coverage threshold. The number of pir sub-families found to be expressed in each assembly is also dependent on the sample quality and sequencing depth, meaning that some families may be absent due to the lack of detected reads even if they are biologically still present. This presents a limitation for concluding whether sub-families are present in given life-cycle stages or geographical regions. Comparison of the transcriptome assembly tools threw up some further surprises. Even though rnaSPAdes was developed specifically for transcriptomes, its sister implementation SPAdes (using the option developed for single-cell genome assembly) gave similar results. A comparison by Holzer & Marz, 2019, between many assemblers, observed the comparable and sometimes even superior performance of SPAdes (single-cell) across multiple metrics, so the single-cell specific algorithm does improve transcript assembly in certain contexts. Trinity is often the default choice for bioinformaticians, including in some of the original studies included in this analysis [27, 28], however it was much less effective for the assembling of pir genes than the SPAdes tools. Trinity also produced the lowest BUSCO scores, however the decline in assembly quality was small, while the decline in pir transcript recovery was stark. Since pir s are generally transcribed at low levels, the scant reads from these transcripts may have been removed by Trinity’s initial k-mer ranking and filtering algorithm. A future execution of this workflow could include more de novo transcriptome assemblers to potentially expand and improve upon the repertoire of pir s detected. Each assembler has its own strengths and weaknesses and can work better for certain species over others [48–55], so one stands to gain further transcriptional insight by including more tools like Bridger [56], Trans-ABySS [57], and SOAPdenovo-trans [58], among others. EvidentialGene worked well to create a meta-assembly from the outputs of the three individual assemblers, proving the usefulness of its algorithms in such a pipeline. This software is not as fully documented as many other pieces of bioinformatic programming, and it would benefit from more expansive explanations. Much of our understanding of putative pir functions and transcriptional kinetics comes from rodent-infecting parasites. Consistent evidence from P. berghei, across multiple stages of the life cycle and different experiments, show some expression in the mosquito (especially at the oocyst stage) and increasing expression upon entering the mammalian host [18, 29, 59, 60]. Here we found - from one experiment, MosqPeru23 - that around 22 hours after P. vivax -infected bloodmeal uptake by the mosquito there is a small but consistent surge in pir transcripts, coinciding with the conversion of zygotes to ookinetes. A signal from the 9_E sub-family was notably strong at this time-point, suggesting that its members have a role in the mosquito. To our knowledge this is the first indication that ookinetes of a human-infective Plasmodium spp. express pir genes, following on from the first mosquito-stage pir transcription identified in the P. berghei malaria cell atlas [60]. Together these data suggest that some pir genes may have a role in the mosquito vector. We were not able to gain insight into the pir transcriptomes of certain life-cycle stages, such as liver stages due to a lack of appropriate data. Gural et al., 2018 [61], had to contend with a massive abundance of host mRNA compared to parasite mRNA, and so they employed hybrid capture sequences to enrich for the P. vivax transcripts. The capture sequences were based on the P01 reference genome, so it is unlikely that the sequencing results would have contained novel pir sequences for extraction. For the murine-infective parasite P. berghei the malaria cell atlas [60] has shown trophozoite-like pir transcription in the liver stages, and [59], have shown that PIR proteins are expressed in the late liver stages, demonstrating that the genes may play a role also in the exo-erythrocytic stages of other malaria parasites and represent an important avenue of future work. Pir genes of both P. berghei and P. chabaudi demonstrate especially high transcription during the asexual blood stages, however P. berghei shows greater sexual dimorphism of pir transcription (with male gametocytes enriched for pir expression) [62]. P. chabaudi infections in mice particularly demonstrate the importance of the gene family in the rodent asexual stages, providing evidence of an association between pir transcription and the virulence of infection or chronic recrudescence [11, 12]. We can make comparisons between rodent pir transcription and the P. vivax life-cycle stages that gave good quality assemblies. The data of [28] (AsexThai16), which sampled across the intra-eyrthrocytic development cycle (IDC) of P. vivax , suggest that although pir transcription is unimodal over the cycle, the timing of this peak within the asexual cycle is different for P. vivax and P. c. chabaudi [29] ; a result observable in the original paper and verified here (Supplementary Figure 6). For P. c. chabaudi the peak of pir transcription occurs around the time of schizont differentiation, while for P. vivax the peak is around schizont bursting and ring-stage formation. Microarray comparisons between multiple Plasmodium sp. using 1-1 orthologs found that the genes with the most variation in timing across the genus were enriched for those transcribed primarily during the early ring and early schizont stages, so it may be common for IDC genes to show altered patterns of expression [63]. The interpretation of the P. vivax IDC data is complicated by the fact that it is generated from in vitro culture, especially problematic at the 48h time-points when toxicity of the culture could be causing artefacts ( P. vivax cannot be cultured long-term [64–66]). Investigations from natural infections are impeded by the asynchronous nature of the P. vivax IDC. Single-cell RNAseq provides the best opportunity to gain access to the individual stages of the IDC. Additionally, sequencing of sexual stages would be of additional interest since the mouse-infective species P. berghei and P. c. chabaudi show contrasting patterns of pir transcription between sexes; both show enriched expression in males, although only P. c. chabaudi has a female gametocyte-specific subset too [62]. Although there is a conserved transcription of pir s across malaria species, the exact timing may differ between simian- and mice-infective clades. We used a BLAST similarity network to define P. vivax sub-families, showing that the groupings previously annotated before high-throughput sequencing broadly still describe the community structure of gene sequences found in isolates obtained worldwide. This was already suggested by existing reference genome assemblies, which originated from different countries but all had overlapping sequences. Nonetheless, our similarity network suggested that some definitions needed updating, such as the largest sub-family E which is better described as multiple sub-families. Clusters on the network (corresponding to sub-families) of note include the largest sub-family 1_E, which tends to dominate P. vivax expression profiles, and 4_C, the sub-family that includes the ancestral pir gene for each isolate [21]. It is curious that the ancestral pir is part of a greater sub-family instead of on a lone lineage, as is observed in phylogenies of the murine-infective Plasmodium sp. pir s. Similar to other species the ancestral pir is often highly transcribed, however many de novo assemblies did not include the sequence. This could be because the assembly software did not ‘find’ this transcript among the RNAseq reads. Only two of the P. c. chabaudi de novo assemblies did not give rise to >95% identity BLAST matches of the ancestral pir , and these were both from the outputs of individual assembler tools (all the combined assemblies assembled accurate full-length ancestral transcripts). Transcription of the ancestral pir may be relatively lower in P. vivax compared to other species, including P. c. chabaudi [29]. If the ancestral pir is downregulated in P. vivax compared to the rodent malaria parasites, this could be due to functional redundancy from the other similar members of the 4_C vivax pir sub-family. Other pre-defined sub-families showed persistently low transcription/presence in the assemblies, including cluster 11 (K), cluster 12 (J), and cluster 21 (B). These pir s are likely un-expressed and may only be found through genome sequencing. Since this study used RNAseq data, we tested whether the distribution of pir sub-families across geographical localities held true for pir s that are actually transcribed, and indeed it did. Even when statistical tests were conducted to test for differences in pir sub-family transcription across the world, only small associations of questionable relevance were found. Evidence suggests that, with a few exceptions, the genotypes of wild P. vivax isolates tend to cluster by their geographic source [67]. A particularly recent founder effect can be found in the American populations of P. vivax , which likely derive from European colonization [67, 68]. There does not appear to be any reflection of these geographically restricted clades in the data presented here. If we assume that the spread of P. vivax strains through human movement has not overwritten regional distinctions, this result suggests that the sub-families of the pir s are relatively unchanged since the last common ancestor of P. vivax , and hence they either have a purpose that the parasite needs to preserve, or that not enough time has passed for them to significantly diverge. Estimates of the timing of the most recent common ancestor of P. vivax range from around 50–300,000 years ago [24, 69–72], a relatively short evolutionary time. However, the genetic variation of multigene families like the pir s should lead to accelerated change and loss/gain of sub-family copy numbers or transcriptional rates. Expression differences of individual pir genes between samples from different continents have been observed before when aligning to the P01 genome pir s [73], however we suggest that this does not expand to the overall sub-families themselves. New P. vivax genomes/transcriptomes from different regions of the world continue to be sequenced, offering opportunities to incorporate them into the pipeline and challenge this conclusion [73–75]. Some isolates of interest may be found from patients on the China-Myanmar border in the upper Mekong, which have previously been shown to have a particularly large C sub-family [76].Only a single Papua New Guinean sample was available for this study, however it showed a distinct profile and was an outlier in the relationship between BUSCO score and pir number (having a potent abundance of pir s despite a relatively incomplete assembly). Analysis of orthologue diversity across P. vivax isolates from multiple countries demonstrated that PNG parasites were particularly diverse compared to those from other regions [67], so it is plausible that this variation is reflected in multigene families too. Deep RNA-sequencing and de novo assembly of PNG-sourced P. vivax transcriptomes could shed light on whether the pir s of these pathogens are unique to the country. De novo transcriptomics can be employed across diverse P. vivax RNAseq datasets in order to observe the transcription of pir sequences missing from the reference genomes. This method unveiled a refined sub-family structure and demonstrated that these sub-families are expressed in dispersed parasite isolates around the world, suggesting that they have functions that lead to their retention. Although thought previously to have no role in mosquito stages, now evidence exists from both P. berghei and P. vivax that there is indeed pir gene expression in the vector, opening new opportunities to understand this gene family. Materials and Methods Experimental information from Carlos et al., unpublished (MosqPeru23) Female An darlingi mosquitoes were fed on three patients (P2, P5, and P6) from Iquitos, Peru, with active P. vivax infections as determined from blood smears. At each timepoint post-bloodmeal and for each patient isolate 30–45 midguts were dissected and the RNA processed for Illumina NextSeq 500 sequencing. Plasmodium spp. data download P. vivax datasets were downloaded using the FetchNGS pipeline v1.5 available from nf-core using Nextflow v22.10.3 and Singularity v3.6.4. Default settings were used except for ‘--force_sratools_download’ to use SRA-toolkit for the download, and an additional custom config file ( https://github.com/timslittle/Thesis_Github_221023/blob/main/custom.config ) supplied via ‘-c’ to permit the SRA-toolkit function ‘prefetch’ to use larger files than default. The configuration profile used for all Nextflow pipelines was ‘Singularity’ and a locally maintained Crick profile ( https://github.com/nf-core/configs/blob/master/docs/crick.md ). The exceptions were the Carlos et al. (manuscript in preparation) samples and the P. c. chabaudi test samples, which were downloaded directly from lab storage, and the [ 77 ] samples, which were downloaded from cloud storage (kindly provided by the authors). At this point files were concatenated together if they constituted the same isolate of P. vivax , to maximise the amount of information for the transcriptome assemblers, but minimise the erroneous assembly of transcripts from stitching reads of different sources. Where data source was unclear from the original publication, the samples were kept separate for assembly, such as for [ 78 ]. See Supp File 1 for the full list of files concatenated together and the rationale for this. De novo transcriptome assembly pipeline with Nextflow The pipeline for de novo assembly construction, ‘transcript_corral’ (development version 1, commit: 6b93401, https://github.com/timslittle/nf-core-transcriptcorral/tree/6b93401496098d2759e875a7611cf0eb1b4268c4 ) was written with Nextflow v22.10.3 and nf-core/tools v2.7.dev0, based on an early design and concept by Matthew McGowan. The pipeline was submitted with the parameters: ‘-profile singularity,crick --skip_trimming false --remove_ribo_rna true --filter_genome [/path/to/genome_to_filter.fa] --assemble_trinity true --assemble_spades_sc true --assemble_spades_rna true --use_evigene true --hmmsearch_hmmfile [/path/to/Pfam-Plasmodium_Vir.hmm] --busco_lineage 'plasmodium_odb10' --salmon_alignment true --salmon_gtf false’. In summary, the pipeline begins by concatenating files that require concatenation (see above), running Trim_Galore! v0.6.7 (which uses cutadapt v3.4) to trim adapter sequences, and running FastQC (v0.11.9) for quality analysis. Reads within the RNAseq datasets that align to potential contaminant genomes were then removed using HISAT2 v2.1.0 with parameters ‘-q -x --un-conc-gz’, with the last parameter saving the sequences which do not align concordantly to the reference [ 79 ]. The contaminant genome was a concatenation of the Mus musculus genome (due to the mouse fibroblasts in the [ 61 ] system), the Homo sapiens genome GRCh38 [ 80 ], the mosquito Anopheles dirus WRAIR2 AdirW1 genome (due to the mosquito-sourced samples of [ 81 ]) [ 82 , 83 ] and Anopheles darlingii AdarC3 (due to the Carlos et al., unpublished, samples) [ 84 ], as well as the primate Saimiri boliviensis SaiBol1 genome [ 85 ] and Aotus nancymaae GCA_000952055.2 genome [ 86 ] (the monkeys used in [ 43 ]). The HISAT2 alignment removal was performed with the relevant parameters altered for the single-end and paired-end stranded P. vivax samples: for the stranded libraries of Kim et al., 2017 and 2019 [ 27 , 87 ] samples HISAT2 was performed with ‘--rna-strandness FR’; for the single-end libraries of Muller et al., 2019 [ 88 ], and Gural et al., 2018 [ 61 ], HISAT2 was performed with ‘--un-gz’ instead of ‘--un-conc-gz’. rRNA reads were subsequently removed using sortmerna v4.3.4 [ 89 ]. These output sequences were then used for assembly using sc-spades and rna-spades (SPAdes v3.15.4 with options ‘--sc’ and ‘--rna’ respectively), and Trinity v2.13.2. The assemblies were combined, and redundant sequences removed, using EvidentialGene version 22may07, which assesses similar sequences for the most optimal representative transcript/peptide [ 30 , 31 ]. In brief, CDS sequences are identified and scored, perfect duplicates and fragments are removed, alternative transcripts identified using 98% identity BLAST alignments of coding sequence, and a primary transcript is finally identified. To ascertain the quality of the final meta-assemblies they were ran through BUSCO v5.4.3 with ‘plasmodium_odb10’ as the reference, to see how many expected, conserved Plasmodium genes are recovered [ 32 , 90 ]. HMMer v3.3.2 is used to identify the sequences which resemble the Plasmodium_Vir Pfam model (PF05795) [ 91 ], then matches with an E-value of less than 1e-3 were extracted [ 92 ]. Transcriptional quantification of the samples was used to both detect which of the proposed pir transcripts have RNAseq reads map back to them, and to evaluate how the pir s are being transcribed. All the original biological replicates (after Trim_Galore!, HISAT2, and sortmerna processing) were aligned to their corresponding assemblies using the transcript-aware aligner Salmon v1.9.0 [ 93 ]. The parameters ‘-p 8 --validateMappings -I A’ were used for all alignments, with ‘-l SR’ specified instead of ‘-I A’ for the stranded libraries. Salmon calculates TPM automatically, so these results were used directly to filter the HMMer ORF outputs for the PIR sequences with evidence of transcription (greater than or equal to 1 TPM in at least one sample). The pipeline was run in separate batches to reduce load on computational resources at any given time. To check whether the pipeline failed to assemble/identify any given sub-families of pir genes, simulated RNAseq reads were generated based on the Sal1, P01, and W1 reference genomes (excluding duplicate sequences) using Rsubread v 2.0.1 [ 94 ] simReads with ‘paired.end = TRUE’ and all relative TPMs set to 1 (ensuring that all transcripts are ‘expressed’ at the same level, controlling for transcript length). The simulated RNAseq fastq files were ran through the pipeline as described above, and the output pir transcripts were compared to the reference genome pir s using tBLASTn v2.9.0 (‘-evalue 1e-3 -max_target_seqs 500 -outfmt "6 std qcovs qcovhsp slen nident’). For the mosquito blood-meal RNAseq results of MosqPeru23 (Carlos et al., manuscript in preparation), the 18h samples were excluded from the final figure due to this timepoint consistently showing globally distinct transcriptional profiles with zero pir expression. The reason for this is unclear, and may be due to biological signal (e.g. a transition time for the parasite between zygote and ookinete forms) or a technical artefact. P. c. chabaudi AS benchmarking of the de novo transcriptome method to find pir genes The 24h asexual cycle samples from Little and Cunningham et al., 2021, [ 29 ] were used to test this method for finding pir genes in RNAseq datasets, The Nextflow pipeline was run for each assembler separately, then with all assemblers together to produce a meta-assembly. To remove duplicates from these assemblies they were processed by CD-HIT v 4.8.1. To compare the use of EvidentialGene for meta-assembly construction, the meta-assembly was also run through EvidentialGene instead of CD-HIT. Instead of running HMMer and finding pir sequence, the peptide outputs of the pipeline were matched to the known PIR peptides from PlasmoDB v61 using BLAST v2.9.0 with parameters “-evalue 1e-3 -num_threads 6 -max_target_seqs 500 -outfmt "6 std qcovs qcovhsp slen nident". Creating networks of PIRs and sub-family assignment To create networks of the PIR ORFs and work out which sub-family they belong to/how well existing definitions describe these PIRs, I downloaded the already published PIR sequences from P. vivax strains P01, Sal1, and W1, and P. ovale , P. vivax-like , P. malariae, P. brasilianum, P. knowlesi, P. coatneyi , and P. cynomolgi , using a PlasmoDB v62 search for “PIR protein” or “VIR protein” in ‘Product Description’, as well as the a Pfam search for PF05795: Plasmodium_Vir and PF06022: Cir_Bir_Yir Variant antigen (adding the results of both the text and PFAM search together), then finally filtering out pseudogenes. These ‘known’ PIRs were combined with the de novo PIRs and BLASTp v2.9.0 was performed between this dataset and itself (‘-evalue 1e-3 -num_threads 6 -max_target_seqs 500 -outfmt "6 std qcovs qcovhsp slen nident"’). MCL clustering was performed with varying inflation values (influencing the number of clusters) of 1.2, 1.4, 1.8, 2, 2.5, 3, 4 and 6 (mcxload parameters: ‘--stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)'’; mcl and mcxdump parameters all default) [ 41 ]. As reported in the results, the best compromise between minimal total number of clusters and minimal mixing of Sal1 sub-family sequences was found with an MCL inflation value of 1.4. From these MCL clusters sub-families were assigned based on the majority Sal1 sub-family sequence in the cluster. If a cluster had no assigned Sal1 sub-family sequence (note that there may be unassigned Sal1 sequence present which were not identified as belonging to any sub-family in the original study) then it was marked as a ‘New’ sub-family. These were then numbered if the assigned name was not unique. The network was visualised using Gephi [ 95 ] and the OpenOrd layout algorithm [ 96 ] with 25% Liquid stage, 100% Expansion stage, 15% Cooldown stage and no Crunch or Simmer stage of the simulated annealing process, 0.8 edge cut, 7 threads, 750 iterations, 0.2 fixed time and a random seed of -9, followed by briefly running the Noverlap algorithm to reduce overlapping nodes with speed set to 3, ratio 0.5 and margin 5. Statistical analysis and figure production in R Data analysis was conducted using stringr [ 97 ], dplyr[ 98 ] and data.table [ 99 ], figures were drawn using ggplot2 [ 100 ], ComplexHeatmap [ 101 ], circlize[ 102 ] and viridis[ 103 ] packages in R v3.6.2 [ 104 ]. The negative binomial model was fitted using the base R function ‘glm.nb’ with the formula ‘TPM ~ sub-family * country * lifecycle_stage + BUSCO_score’. Abbreviations pir s Plasmodium interspersed repeats HMM Hidden Markov Model RNAseq High-throughput RNA sequencing BUSCO Benchmarking Universal Single-Copy Orthologs TPM Transcripts-per-Million IDC Intra-eyrthrocytic Development Cycle Declarations Code availability The code used in generating figures and performing analysis is available at https://github.com/timslittle/pirDeNovo_2025. Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Funding This work was supported by the Francis Crick Institute which receives its core funding from the UK Medical Research Council (CC2079), Cancer Research UK (CC2079) and the Wellcome Trust (CC2079); JL was a Wellcome Trust Senior Investigator (grant reference WT101777MA); TSL was the recipient of an Imperial College/Francis Crick PhD stipend. Authors' contributions Conceptualisation: TSL, DAC, GKC, AJR, JL. Analysis: TSL. Supervision: GKC, AJR, JL. Writing—review & editing: TSL, DAC, AJR, JL. Data Availability Most datasets generated in this study were obtained from public databases. The MosqPeru23 datasets are available from the European Nucleotide Archive at PRJEB29445. Acknowledgements The authors are grateful to the original scientists who published the studies re-analysed here, as well as Jayme Souza Neto, Bianca Cechetto Carlos, and Dina Vlachou for permitting us to use the MosqPeru23 data prior to release and publication. Additionally, the authors thank Dr. George Young (MRC Laboratory of Medical Sciences, formerly of The Francis Crick Institute), for his help with bioinformatic analysis. Competing interests None declared. References WHO. World Malaria Report 2021. Geneva; 2021. Baird JK. African Plasmodium vivax malaria improbably rare or benign. Trends Parasitol. 2022;0. Culleton R, Ndounga M, Zeyrek FY, Coban C, Casimiro PN, Takeo S, et al. Evidence for the Transmission of Plasmodium vivax in the Republic of the Congo, West Central Africa. J Infect Dis. 2009;200:1465–9. Kho S, Qotrunnada L, Leonardo L, Andries B, Wardani PAI, Fricot A, et al. Hidden Biomass of Intact Malaria Parasites in the Human Spleen. New England Journal of Medicine. 2021;384:2067–9. Reid AJ. Large, rapidly evolving gene families are at the forefront of host–parasite interactions in Apicomplexa. Parasitology. 2015;142:S57–70. Janssen CS, Phillips RS, Turner MR, Barret MP. Plasmodium interspersed repeats : The major multigene superfamily of malaria parasites. Nucleic Acids Res. 2004;32:5712–20. Lopez FJ, Bernabeu M, Fernandez-Becerra C, del Portillo HA. A new computational approach redefines the subtelomeric vir superfamily of Plasmodium vivax . BMC Genomics. 2013;14:8. del Portillo HA, Fernandez-Becerra C, Bowman S, Oliver K, Preuss M, Sanchez CP, et al. A superfamily of variant genes encoded in the subtelomeric region of Plasmodium vivax . Nature. 2001;410:839–42. Su X, Heatwole VM, Wertheimer SP, Guinet F, Herrfeldt JA, Peterson DS, et al. The large diverse gene family var encodes proteins involved in cytoadherence and antigenic variation of Plasmodium falciparum -infected erythrocytes. Cell. 1995;82:89–100. Fried M, Duffy PE. Adherence of Plasmodium falciparum to Chondroitin Sulfate A in the Human Placenta. Science (1979). 1996;272:1502–4. Spence PJ, Jarra W, Lévy P, Reid AJ, Chappell L, Brugat T, et al. Vector transmission regulates immune control of Plasmodium virulence. Nature. 2013;498:228–31. Brugat T, Reid AJ, Lin J, Cunningham D, Tumwine I, Kushinga G, et al. Antibody-independent mechanisms regulate the establishment of chronic Plasmodium infection. Nat Microbiol. 2017;2:16276. Bernabeu M, Lopez FJ, Ferrer M, Martin-Jaular L, Razaname A, Corradin G, et al. Functional analysis of Plasmodium vivax VIR proteins reveals different subcellular localizations and cytoadherence to the ICAM-1 endothelial receptor. Cell Microbiol. 2012;14:386–400. Fernandez-Becerra C, Bernabeu M, Castellanos A, Correa BR, Obadia T, Ramirez M, et al. Plasmodium vivax spleen-dependent genes encode antigens associated with cytoadhesion and clinical protection. Proceedings of the National Academy of Sciences. 2020;:201920596. Ansari HR, Templeton TJ, Subudhi AK, Ramaprasad A, Tang J, Lu F, et al. Genome-scale comparison of expanded gene families in Plasmodium ovale wallikeri and Plasmodium ovale curtisi with Plasmodium malariae and with other Plasmodium species. Int J Parasitol. 2016;46:685–96. Rutledge GG, Böhme U, Sanders M, Reid AJ, Cotton JA, Maiga-Ascofare O, et al. Plasmodium malariae and P. ovale genomes provide insights into malaria parasite evolution. Nature. 2017;542:101–4. Carlton JM, Angiuoli S V., Suh BB, Kooij TW, Pertea M, Silva JC, et al. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii . Nature. 2002;419:512–9. Otto TD, Böhme U, Jackson AP, Hunt M, Franke-Fayard B, Hoeijmakers WAM, et al. A comprehensive evaluation of rodent malaria parasite genomes and gene expression. BMC Biol. 2014;12:86. Auburn S, Böhme U, Steinbiss S, Trimarsanto H, Hostetler J, Sanders M, et al. A new Plasmodium vivax reference sequence with improved assembly of the subtelomeres reveals an abundance of pir genes. Wellcome Open Res. 2016;1:4. Minassian AM, Themistocleous Y, Silk SE, Barrett JR, Kemp A, Quinkert D, et al. Controlled human malaria infection with a clone of Plasmodium vivax with high-quality genome assembly. JCI Insight. 2021;6. Frech C, Chen N. Variant surface antigens of malaria parasites: functional and evolutionary insights from comparative gene family classification and analysis. BMC Genomics. 2013;14:427. Merino EF, Fernandez-Becerra C, Durham AM, Ferreira JE, Tumilasci VF, d’Arc-Neves J, et al. Multi-character population study of the vir subtelomeric multigene superfamily of Plasmodium vivax , a major human malaria parasite. Mol Biochem Parasitol. 2006;149:10–6. Chen S-BB, Wang Y, Kassegne K, Xu B, Shen H-MM, Chen J-HH. Whole-genome sequencing of a Plasmodium vivax clinical isolate exhibits geographical characteristics and high genetic variation in China-Myanmar border area. BMC Genomics. 2017;18:131. Neafsey DE, Galinsky K, Jiang RHY, Young L, Sykes SM, Saif S, et al. The malaria parasite Plasmodium vivax exhibits greater genetic diversity than Plasmodium falciparum . Nat Genet. 2012;44:1046–50. Zhang C, Oguz C, Huse S, Xia L, Wu J, Peng YC, et al. Genome sequence, transcriptome, and annotation of rodent malaria parasite Plasmodium yoelii nigeriensis N67. BMC Genomics. 2021;22:1–12. Brashear AM, Roobsoong W, Siddiqui FA, Nguitragool W, Sattabongkot J, López-Uribe MM, et al. A glance of the blood stage transcriptome of a Southeast Asian Plasmodium ovale isolate. PLoS Negl Trop Dis. 2019;13:e0007850. Kim A, Popovici J, Vantaux A, Samreth R, Bin S, Kim S, et al. Characterization of P. vivax blood stage transcriptomes from field isolates reveals similarities among infections and complex gene isoforms. Sci Rep. 2017;7:7761. Zhu L, Mok S, Imwong M, Jaidee A, Russell B, Nosten F, et al. New insights into the Plasmodium vivax transcriptome using RNA-Seq. Sci Rep. 2016;6:20498. Little TS, Cunningham DA, Vandomme A, Lopez CT, Amis S, Alder C, et al. Analysis of pir gene expression across the Plasmodium life cycle. Malar J. 2021;20:1–14. Gilbert D. Accurate & complete gene construction with EvidentialGene. F1000Res. 2016;5. Gilbert DG. Genes of the pig, Sus scrofa , reconstructed with EvidentialGene. PeerJ. 2019;7:e6374. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva E V., Zdobnov EM. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2. Bushmanova E, Antipov D, Lapidus A, Prjibelski AD. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. Gigascience. 2019;8:1–13. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST:a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. Pain A, Böhme U, Berry AE, Mungall K, Finn RD, Jackson AP, et al. The genome of the simian and human malaria parasite Plasmodium knowlesi . Nature. 2008;455:799–803. Bajic M, Ravishankar S, Sheth M, Rowe LA, Pacheco MA, Patel DS, et al. The first complete genome of the simian malaria parasite Plasmodium brasilianum. Scientific Reports 2022 12:1. 2022;12:1–13. Pasini EM, Böhme U, Rutledge GG, Voorberg-Van der Wel A, Sanders M, Berriman M, et al. An improved Plasmodium cynomolgi genome assembly reveals an unexpected methyltransferase gene expansion. Wellcome Open Res. 2017;2 May:42. Tachibana S-I, Sullivan SA, Kawai S, Nakamura S, Kim HR, Goto N, et al. Plasmodium cynomolgi genome sequences provide insight into Plasmodium vivax and the monkey malaria clade. Nat Genet. 2012;44:1051–5. Chien J-T, Pakala SB, Geraldo JA, Lapp SA, Humphrey JC, Barnwell JW, et al. High-Quality Genome Assembly and Annotation for Plasmodium coatneyi, Generated Using Single-Molecule Real-Time PacBio Technology. Genome Announc. 2016;4. Gilabert A, Otto TD, Rutledge GG, Franzon B, Ollomo B, Arnathau C, et al. Plasmodium vivax-like genome sequences shed new insights into Plasmodium vivax biology and evolution. PLoS Biol. 2018;16:e2006035. Van Dongen S, Abreu-Goodger C. Using MCL to Extract Clusters from Networks. Methods in Molecular Biology. 2012;804:281–95. Ansari HR, Templeton TJ, Subudhi AK, Ramaprasad A, Tang J, Lu F, et al. Genome-scale comparison of expanded gene families in Plasmodium ovale wallikeri and Plasmodium ovale curtisi with Plasmodium malariae and with other Plasmodium species. Int J Parasitol. 2016;46:685–96. Gunalan K, Sá JM, Moraes Barros RR, Anzick SL, Caleon RL, Mershon JP, et al. Transcriptome profiling of Plasmodium vivax in Saimiri monkeys identifies potential ligands for invasion. Proc Natl Acad Sci U S A. 2019;116. Zollner GE, Ponsa N, Garman GW, Poudel S, Bell JA, Sattabongkot J, et al. Population dynamics of sporogony for Plasmodium vivax parasites from western Thailand developing within three species of colonized Anopheles mosquitoes. Malar J. 2006;5:1–17. Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nature Biotechnology. 2017;35:316–9. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–52. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455–77. Mamrot J, Legaie R, Ellery SJ, Wilson T, Seemann T, Powell DR, et al. De novo transcriptome assembly for the spiny mouse ( Acomys cahirinus ). Scientific Reports 2017 7:1. 2017;7:1–15. Chopra R, Burow G, Farmer A, Mudge J, Simpson CE, Burow MD. Comparisons of de novo transcriptome assemblers in diploid and polyploid species using peanut ( Arachis spp. ) RNA-Seq data. PLoS One. 2014;9:e115055. Amin S, Prentis PJ, Gilding EK, Pavasovic A. Assembly and annotation of a non-model gastropod ( Nerita melanotragus ) transcriptome: A comparison of de novo assemblers. BMC Res Notes. 2014;7:1–8. Francis WR, Christianson LM, Kiko R, Powers ML, Shaner NC, D Haddock SH. A comparison across non-model animals suggests an optimal sequencing depth for de novo transcriptome assembly. BMC Genomics. 2013;14:1–12. Madritsch S, Burg A, Sehr EM. Comparing de novo transcriptome assembly tools in di- and autotetraploid non-model plant species. BMC Bioinformatics. 2021;22:1–17. Yahav T, Privman E. A comparative analysis of methods for de novo assembly of hymenopteran genomes using either haploid or diploid samples. Sci Rep. 2019;9:1–10. Rana SB, Zadlock FJ, Zhang Z, Murphy WR, Bentivegna CS. Comparison of de novo transcriptome assemblers and k-mer strategies using the killifish, Fundulus heteroclitus . PLoS One. 2016;11:e0153104. Hölzer M, Marz M. De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers. Gigascience. 2019;8:1–16. Chang Z, Li G, Liu J, Zhang Y, Ashby C, Liu D, et al. Bridger: A new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol. 2015;16:1–10. Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, et al. De novo transcriptome assembly with ABySS. Bioinformatics. 2009;25:2872–7. Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics. 2014;30:1660–6. Pasini EM, Braks JA, Fonager J, Klop O, Aime E, Spaccapelo R, et al. Proteomic and Genetic Analyses Demonstrate that Plasmodium berghei Blood Stages Export a Large and Diverse Repertoire of Proteins. Molecular & Cellular Proteomics. 2013;12:426–48. Howick VM, Russell AJC, Andrews T, Heaton H, Reid AJ, Natarajan K, et al. The Malaria Cell Atlas: Single parasite transcriptomes across the complete Plasmodium life cycle. Science (1979). 2019;365:eaaw2619. Gural N, Mancio-Silva L, Miller AB, Galstian A, Butty VL, Levine SS, et al. In Vitro Culture, Drug Sensitivity, and Transcriptome of Plasmodium vivax Hypnozoites. Cell Host Microbe. 2018;23:395-406.e4. Cunningham DA, Reid AJ, Hosking C, Deroost K, Tumwine-Downey I, Sanders M, et al. Identification of gametocyte-associated pir genes in the rodent malaria parasite, Plasmodium chabaudi chabaudi AS. BMC Res Notes. 2023;16. Hoo R, Zhu L, Amaladoss A, Mok S, Natalang O, Lapp SA, et al. Integrated analysis of the Plasmodium species transcriptome. EBioMedicine. 2016;7:255–66. Bass CC, Johns FM. THE CULTIVATION OF MALARIAL PLASMODIA (PLASMODIUM VIVAX AND PLASMODIUM FALCIPARUM) IN VITRO. Journal of Experimental Medicine. 1912;16:567–79. Noulin F, Borlon C, Van Den Abbeele J, D’Alessandro U, Erhart A. 1912-2012: A century of research on Plasmodium vivax in vitro culture. Trends Parasitol. 2013;29:286–94. Thomson JG, Thomson D, Fantham HB. The cultivation of one generation of benign tertian malarial parasites (plasmodium vivax) in vitro, by bass’s method. Ann Trop Med Parasitol. 1913;7:153–64. Hupalo DN, Luo Z, Melnikov A, Sutton PL, Rogov P, Escalante A, et al. Population genomics studies identify signatures of global dispersal and drug resistance in Plasmodium vivax. Nat Genet. 2016;48:953–8. Imwong M, Nair S, Pukrittayakamee S, Sudimack D, Williams JT, Mayxay M, et al. Contrasting genetic structure in Plasmodium vivax populations from Asia and South America. Int J Parasitol. 2007;37:1013–22. Mu J, Joy DA, Duan J, Huang Y, Carlton J, Walker J, et al. Host Switch Leads to Emergence of Plasmodium vivax Malaria in Humans. Mol Biol Evol. 2005;22:1686–93. Daron J, Boissière A, Boundenga L, Ngoubangoye B, Houze S, Arnathau C, et al. Population genomic evidence of Plasmodium vivax Southeast Asian origin. Sci Adv. 2021;7. Daron J, Boissière A, Boundenga L, Ngoubangoye B, Houze S, Arnathau C, et al. Population genomic evidence of Plasmodium vivax Southeast Asian origin. Sci Adv. 2021;7. Prajapati SK, Joshi H, Carlton JM, Rizvi MA. Neutral Polymorphisms in Putative Housekeeping Genes and Tandem Repeats Unravels the Population Genetics and Evolutionary History of Plasmodium vivax in India. PLoS Negl Trop Dis. 2013;7:e2425. Kepple D, Ford CT, Williams J, Abagero B, Li S, Popovici J, et al. Comparative transcriptomics reveal differential gene expression among Plasmodium vivax geographical isolates and implications on erythrocyte invasion mechanisms. PLoS Negl Trop Dis. 2024;18:e0011926. De Meulenaere K, Cuypers B, Gamboa D, Laukens K, Rosanas-Urgell A. A new Plasmodium vivax reference genome for South American isolates. BMC Genomics. 2023;24:1–14. Callejas-Hernández F, Nikulkova M, Adamski N, Yan G, Yewhalaw D, Carlton JM. Assembled genome of an Ethiopian Plasmodium vivax isolate generated using GridION long-read technology. Microbiol Resour Announc. 2024;13:e00590-24. Brashear AM, Huckaby AC, Fan Q, Dillard LJ, Hu Y, Li Y, et al. New Plasmodium vivax Genomes From the China-Myanmar Border. Front Microbiol. 2020;0:1930. Bourgard C, Lopes SCP, Lacerda MVG, Albrecht L, Costa FTM. A suitable RNA preparation methodology for whole transcriptome shotgun sequencing harvested from Plasmodium vivax -infected patients. Sci Rep. 2021;11:5089. Roth A, Adapa SR, Zhang M, Liao X, Saxena V, Goffe R, et al. Unraveling the Plasmodium vivax sporozoite transcriptional journey from mosquito vector to human host. Sci Rep. 2018;8:12183. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12:357–60. Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–64. Boonkaew T, Mongkol W, Prasert S, Paochan P, Yoneda S, Nguitragool W, et al. Transcriptome analysis of Anopheles dirus and Plasmodium vivax at ookinete and oocyst stages. Acta Trop. 2020;207:105502. Amos B, Aurrecoechea C, Barba M, Barreto A, Basenko EY, Bażant W, et al. VEuPathDB: the eukaryotic pathogen, vector and host bioinformatics resource center. Nucleic Acids Res. 2022;50:D898–911. Neafsey DE, Waterhouse RM, Abai MR, Aganezov SS, Alekseyev MA, Allen JE, et al. Mosquito genomics. Highly evolvable malaria vectors: the genomes of 16 Anopheles mosquitoes. Science (1979). 2015;347. Marinotti O, Cerqueira GC, de Almeida LGP, Ferro MIT, da Silva Loreto EL, Zaha A, et al. The genome of Anopheles darlingi, the main neotropical malaria vector. Nucleic Acids Res. 2013;41:7387–400. Chiou KL, Pozzi L, Lynch Alfaro JW, di Fiore A. Pleistocene diversification of living squirrel monkeys ( Saimiri spp. ) inferred from complete mitochondrial genome sequences. Mol Phylogenet Evol. 2011;59:736–45. Babb PL, Fernandez-Duque E, Baiduc CA, Gagneux P, Evans S, Schurr TG. mtDNA diversity in azara’s owl monkeys ( Aotus azarai azarai ) of the Argentinean Chaco. Am J Phys Anthropol. 2011;146:209–24. Kim A, Popovici J, Menard D, Serre D. Plasmodium vivax transcriptomes reveal stage-specific chloroquine response and differential regulation of male and female gametocytes. Nat Commun. 2019;10:371. Muller I, Jex AR, Kappe SHI, Mikolajczak SA, Sattabongkot J, Patrapuvich R, et al. Transcriptome and histone epigenome of Plasmodium vivax salivary-gland sporozoites point to tight regulatory control and mechanisms for liver-stage differentiation in relapsing malaria. Int J Parasitol. 2019;49:501–13. Kopylova E, Noé L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012;28:3211–7. Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 2021;38:4647–54. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021;49:D412–9. Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7:1002195. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14:417–9. Liao Y, Smyth GK, Shi W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 2019;47:e47–e47. Bastian M, Heymann S, Jacomy M. Gephi: An Open Source Software for Exploring and Manipulating Networks. 2009. Martin S, Brown WM, Klavans R, Boyack KW. OpenOrd: an open-source toolbox for large graph layout. In: Visualization and Data Analysis 2011. SPIE; 2011. p. 786806. Wickham H. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. 2019. Wickham H, Romain F, Henry L, Kirill M. dplyr: A Grammar of Data Manipulation. R package version 0.8.1. 2019. https://cran.r-project.org/package=dplyr. Dowle M, Srinivasan A. data.table: Extension of `data.frame`. R package version. 2019. Wickham H. ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag; 2016. Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32:2847–9. Gu Z, Gu L, Eils R, Schlesner M, Brors B. Circlize implements and enhances circular visualization in R. Bioinformatics. 2014;30:2811–2. Garnier S. viridis: Default Color Maps from “matplotlib”. R package version 0.5.1. 2018. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2018. https://www.r-project.org/. Cheng CW, Jongwutiwes S, Putaporntip C, Jackson AP. Clinical expression and antigenic profiles of a Plasmodium vivax vaccine candidate: merozoite surface protein 7 (PvMSP-7). Malar J. 2019;18:197. Collins WE, Contacos PG, Krotoski WA, Howard WA. Transmission of four Central American strains of Plasmodium vivax from monkey to man. J Parasitol. 1972;58:332–5. Siegel S v., Chappell L, Hostetler JB, Amaratunga C, Suon S, Böhme U, et al. Analysis of Plasmodium vivax schizont transcriptomes from field isolates reveals heterogeneity of expression of genes involved in host-parasite interactions. Sci Rep. 2020;10:16667. Rangel GW, Clark MA, Kanjee U, Goldberg JM, MacInnis B, José Menezes M, et al. Plasmodium vivax transcriptional profiling of low input cryopreserved isolates through the intraerythrocytic development cycle. PLoS Negl Trop Dis. 2020;14:e0008104. De Meulenaere K, Prajapati SK, Villasis E, Cuypers B, Kattenberg JH, Kasian B, et al. Band 3–mediated Plasmodium vivax invasion is associated with transcriptional variation in PvTRAg genes. Front Cell Infect Microbiol. 2022;12. Table 1 Reference Life cycle stages Strains used Experiment code Zhu et al., 2016 [28] IDC (different lengths of time in ex vivo culture) NW Thai patient isolates (Mae Sot) AsexThai16 Kim et al., 2017 [27] Mixed blood stages (and one sporozoite) Ratanakiri, Cambodia MixdCamb17 (Same authors as MixdCamb19) Gural et al., 2018 [61] Mixed liver stages and ‘hypnozoite-enriched’ ( in vitro human organelle system) Thai patient isolates LivrThai18 Roth et al., 2018 [78] Sporozoites Myanmar border, Thailand SpztThai18 Cheng et al., 2019 [105] Asexual blood stages 5 N Thai and 5 S Thai patient isolates AsexThai19 Muller et al., 2019 [88] Sporozoites Thai patient isolates (Tak/Ubon) SpztThai19 Kim et al., 2019 [87] Mixed blood stages (with differential counts) 26 Ratanakiri, Cambodia patient isolates MixdCamb19 (Same authors as MixdCamb17) Gunalan et al., 2019 [43] Asexual blood stages from two species of monkey P. vivax Sal I (El Salvador origin [106]) AsexSal119 Boonkaew et al., 2020 [81] Ookinete (18h) and Oocyst (7days) Thai patient isolates (Ubon and Yala) MosqThai20 Siegal et al., 2020 [107] Schizonts Cambodia patient isolates SchzMixd20 Rangel et al., 2020 [108] IDC (4, 20, 36, 44 and 72 hours post-thaw) Acre, Brazil AsexBraz20 Bourgard et al., 2021 [77] Asexual blood stages Manaus, Brazil AsexBraz21 de Meulenaere et al., 2022 [109] Schizonts Iquitos, Peru, and Madang, Papua New Guinea SchzMixd22 Carlos et al., in preparation Mosquito blood-meal/midguts (1h, 6h, 22h, 26h, and 7d) Iquitos, Peru MosqPeru23 Table 1 – Summary of the publications from which P. vivax RNAseq data was taken and used for de novo assembly. The study reference, life cycle stages/conditions extracted, geographic source of the isolates and experimental code used in this paper are all included. A descriptive short identifier was given to each experiment/dataset, stating the life cycle stages covered (e.g. ‘Asex’ for ‘Asexual stages’, ‘Mix’ if multiple stages are included), the country of origin (e.g.’Camb’ for ‘Cambodia', ‘Mix’ if multiple geographical sources are involved), and year of publication. Note that the ‘Asexual’ stages have not been explicitly depleted for sexual stages such as gametocytes or sexual-lineage schizonts, however these should be a minor contaminant if present. Additional Declarations No competing interests reported. Supplementary Files SupplementaryMateriallegends.docx AdditionalFile1samples.xlsx AdditionalFile2deNovoPirs.fasta AdditionalFile3pirClustersSubfam.csv AdditionalFile4pirBLAST99IDmatches.csv SuppFig1chabStatsAsset2.pdf SuppFig2chabscovhspvtpmAsset1.pdf SuppFig3netSpecAsset6.pdf SuppFig4pirRefsimulatedBLASTMatches.pdf SuppFig5vvxSubfamPropheatmaporderCluster.pdf SuppFig6Zhu16Asset1.pdf SuppFig7AsexSal119.pdf SuppFig8MosqPeru23P6.pdf Cite Share Download PDF Status: Published Journal Publication published 29 May, 2025 Read the published version in BMC Genomics → Version 1 posted Editorial decision: Revision requested 29 Apr, 2025 Reviews received at journal 28 Apr, 2025 Reviews received at journal 27 Apr, 2025 Reviewers agreed at journal 14 Apr, 2025 Reviewers agreed at journal 14 Apr, 2025 Reviewers invited by journal 14 Apr, 2025 Submission checks completed at journal 14 Apr, 2025 First submitted to journal 11 Apr, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5822769","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":442835652,"identity":"fbfd1079-fb51-4c8f-9b15-aa903eb1bd53","order_by":0,"name":"Timothy S. Little","email":"","orcid":"","institution":"The Francis Crick Institute","correspondingAuthor":false,"prefix":"","firstName":"Timothy","middleName":"S.","lastName":"Little","suffix":""},{"id":442835654,"identity":"478aa2ff-21b1-4ff7-b6f0-51bfadab7f11","order_by":1,"name":"Deirdre A. Cunningham","email":"","orcid":"","institution":"The Francis Crick Institute","correspondingAuthor":false,"prefix":"","firstName":"Deirdre","middleName":"A.","lastName":"Cunningham","suffix":""},{"id":442835656,"identity":"ca06324f-c40c-4cb1-a7cc-c5c5e6882be1","order_by":2,"name":"George K. Christophides","email":"","orcid":"","institution":"Imperial College London","correspondingAuthor":false,"prefix":"","firstName":"George","middleName":"K.","lastName":"Christophides","suffix":""},{"id":442835658,"identity":"36f0f004-7eae-4811-aa1e-4621db1b0362","order_by":3,"name":"Adam James Reid","email":"","orcid":"","institution":"University of Cambridge","correspondingAuthor":false,"prefix":"","firstName":"Adam","middleName":"James","lastName":"Reid","suffix":""},{"id":442835660,"identity":"55455362-ec07-4641-9e3b-ca4c07cfcd3f","order_by":4,"name":"Jean Langhorne","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABD0lEQVRIiWNgGAWjYBAC9gYgwQgiJMB8GwYGZiAEAx7sWhjRtKSRruUwiCCgpb352QPGBrs8+ejmY58LKs7Lm7czPzZgqLFjMDhzALuWnmPmBowNycWGd44lz55x5rbhnMNsxgkMx5IZDM42YNcyI8FMgrGBOXHjjBxjZt622wkSzDzMBxjYDjAYnMfhsPnPvwG11EO1/DsH1fIPtxbBGTwgWw4nzpcAaWk4ANaSwNh2AKfDpHlyyiQSG44nbpBIS2bmOZZsOIOZzdggsS+ZRxKH9/nYj2+T+PinOnH+jOTDzDw1dvIS/IcfS3z4ZifHdyYBu8tAACRlcABNBEesIAF57A4fBaNgFIyCUcDAAABG1VWhUPuxFAAAAABJRU5ErkJggg==","orcid":"","institution":"The Francis Crick Institute","correspondingAuthor":true,"prefix":"","firstName":"Jean","middleName":"","lastName":"Langhorne","suffix":""}],"badges":[],"createdAt":"2025-01-13 22:08:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5822769/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5822769/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12864-025-11752-1","type":"published","date":"2025-05-29T15:57:09+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":80789467,"identity":"c0849498-20ba-4760-9e96-7831c58fb76a","added_by":"auto","created_at":"2025-04-17 06:35:11","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":489804,"visible":true,"origin":"","legend":"\u003cp\u003eWorkflow diagram of the pipeline used for generating the \u003cem\u003ede novo \u003c/em\u003etranscriptome assemblies, predicting \u003cem\u003epir\u003c/em\u003e transcripts, and then obtaining expression data in TPM. Software used for each step is shown in brackets. Individual samples were concatenated together for the assembly process if they originate from the same source isolate (orange central box), however they were then aligned separately to the merged assembly (right-hand curved arrow) to see how expression changed by condition. Drawn using LucidPlot.\u003c/p\u003e","description":"","filename":"Fig1Flowchart.png","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/579426a79ad3fe2bf9b77df9.png"},{"id":80789470,"identity":"a17f4359-38bb-46d0-8890-70c159d13f11","added_by":"auto","created_at":"2025-04-17 06:35:11","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":383776,"visible":true,"origin":"","legend":"\u003cp\u003eThe number of \u003cem\u003eP. vivax\u003c/em\u003e \u003cem\u003epir\u003c/em\u003es found in each assembly compared to the percentage of complete BUSCOs (including duplicated). Colour of the points denotes the experimental origin and shape denotes the stage of the lifecycle that the parasites were isolated from. The top graph corresponds to the assemblies from blood samples containing asexual stages, and the bottom graph to those from infected mosquito samples. Full length \u003cem\u003epir \u003c/em\u003etranscripts were identified by covering ≧ 75% of the \u003cem\u003ePlasmodium\u003c/em\u003e_\u003cem\u003evir \u003c/em\u003eMarkov model (Pfam: PF05795) and were verified by being transcribed at ≧ 1TPM in at least one sample used to produce that assembly.\u003c/p\u003e","description":"","filename":"Fig2BUSCOvNumPirAsset3.png","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/b97ca58cf90fd6d9e39f7c85.png"},{"id":80789468,"identity":"d69bd312-707b-458b-a7a7-7a36c797ad34","added_by":"auto","created_at":"2025-04-17 06:35:11","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1828756,"visible":true,"origin":"","legend":"\u003cp\u003eA. The number of \u003cem\u003eP. vivax\u003c/em\u003e Sal1 sub-family member \u003cem\u003epir\u003c/em\u003es in each cluster of the network, sorted by size. Cluster ID is named from the largest (Cluster 1) to the smallest (in this graph, Cluster 25). B. The \u003cem\u003epir\u003c/em\u003e sequence similarity network, with the Sal1 reference sub-family members and the \u003cem\u003ede novo\u003c/em\u003e \u003cem\u003epir\u003c/em\u003es coloured as shown in the legend in the figure. White nodes include the other \u003cem\u003eP. vivax\u003c/em\u003e reference \u003cem\u003epir\u003c/em\u003es (from P01 and W1) and other species.\u003c/p\u003e","description":"","filename":"Fig3ClustersSal1NetworkAsset8.png","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/af049f9c962102276f48d6a6.png"},{"id":80790921,"identity":"6f8c3241-e184-4700-9124-346ffedac000","added_by":"auto","created_at":"2025-04-17 06:43:11","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1405648,"visible":true,"origin":"","legend":"\u003cp\u003eThe BLAST similarity network of \u003cem\u003epir\u003c/em\u003es, where the nodes are coloured by MCL clustering (with the largest ten clusters annotated in the legend.\u003c/p\u003e","description":"","filename":"Fig4SubfamAsset5.png","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/892a98d521a087941e06fab8.png"},{"id":80789472,"identity":"e8cec597-ac2e-4626-a24c-6268def06476","added_by":"auto","created_at":"2025-04-17 06:35:11","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":151593,"visible":true,"origin":"","legend":"\u003cp\u003eThe proportion of numbers of \u003cem\u003epir\u003c/em\u003es from each sub-family among the different genomes and this study’s \u003cem\u003ede novo \u003c/em\u003etranscriptome. Each axis is ordered according to hierarchical clustering. The species/strain names are abbreviated as follows: \u003cem\u003eP. coatneyi\u003c/em\u003e, coat; \u003cem\u003eP. knowlesi\u003c/em\u003e, knowl; \u003cem\u003eP. brasilianum\u003c/em\u003e, brasl; \u003cem\u003eP. malariae\u003c/em\u003e, malar; \u003cem\u003eP. ovale\u003c/em\u003e, ovale; \u003cem\u003eP. vivax\u003c/em\u003e P01, vvxP01; \u003cem\u003eP. cynomolgi\u003c/em\u003e, cyno; \u003cem\u003eP. vivax\u003c/em\u003e W1, vvxW1; \u003cem\u003eP. vivax de novo\u003c/em\u003e \u003cem\u003epir\u003c/em\u003e transcripts from this study, de_novo; \u003cem\u003eP. vivax-like\u003c/em\u003e, vvxlike; \u003cem\u003eP. vivax\u003c/em\u003e Sal1, vvxSal1. Note that the TPM colour scheme is skewed to lower values to prevent a few higher values dominating the heatmaps.\u003c/p\u003e","description":"","filename":"Fig5vvxSubfamPropheatmapSpecies1.png","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/0d2eeb6fc20330be0c6c4be4.png"},{"id":80791619,"identity":"521029d2-4725-4d37-b3db-7eb2bf82f1fe","added_by":"auto","created_at":"2025-04-17 06:51:11","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":147650,"visible":true,"origin":"","legend":"\u003cp\u003eThe expression of sub-families shown as the proportion of \u003cem\u003epir\u003c/em\u003e TPM from each sub-family among the samples of each assembly, alongside the corresponding BUSCO scores and geographic source. Each axis is ordered according to hierarchical clustering. Note that the TPM colour scheme is skewed to lower values to show the distinctions at smaller proportions. This figure only shows samples with \u0026gt;50% BUSCO score, since many sub-families are missing from lower quality assemblies, although the SchzMixd22_Sch16 isolate (Schizont sample 16, from Papua New Guinea) was still included as it had a high \u003cem\u003epir\u003c/em\u003e number despite a relatively low BUSCO score.\u003c/p\u003e","description":"","filename":"Fig6vvxSubfamTPMheatmaporderCluster1.png","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/80438a6a4eb7a66d9d4f8e88.png"},{"id":80789476,"identity":"b4a80a6f-163c-4c96-a9dc-8beb8ae67fea","added_by":"auto","created_at":"2025-04-17 06:35:11","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":207599,"visible":true,"origin":"","legend":"\u003cp\u003eThe expression (‘TPM pir’) of \u003cem\u003ede novo\u003c/em\u003eassembled \u003cem\u003epir\u003c/em\u003e transcripts separated by sub-family across the timepoints (time after mosquito bloodmeal) of patient isolates from MosqPeru23. Right-hand label shows whether a \u003cem\u003epir\u003c/em\u003e transcript has \u0026gt;=99% identity to the ancestral \u003cem\u003epir\u003c/em\u003e gene (in this case only one \u003cem\u003epir\u003c/em\u003e transcript - from patient isolate P5 -assembled matched the ancestral copy to this degree). The heatmap box at the bottom of each heatmap, labelled ‘Ortho’, shows the transcription of selected orthologue transcripts that were the best matches of the conserved BUSCO orthologue, as named on the right-hand side of the heatmap body. The TPM colour scheme is separate for this heatmap and is labelled ‘TPM Ortho’. The gene names have been shortened as follows: PSOP1 (putative secreted ookinete protein 1); P25 and P28 (25kDa and 28kDa ookinete surface proteins); CelTOS (Cell-traversal protein for ookinetes and sporozoites); SOAP (secreted \u003cstrong\u003eookinete\u003c/strong\u003e adhesive protein).\u003c/p\u003e","description":"","filename":"Fig7mosquitoPirAsset3.png","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/e288182582c0129ee4659dfd.png"},{"id":83782817,"identity":"cd2e4754-f557-492b-a9a4-55d499e7dd19","added_by":"auto","created_at":"2025-06-02 16:06:55","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":5262186,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/2d7e5319-07a3-446d-b8b3-11c36d581093.pdf"},{"id":80790919,"identity":"5711dd10-4b32-4d92-a2c9-e84d08c41625","added_by":"auto","created_at":"2025-04-17 06:43:11","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":864150,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMateriallegends.docx","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/b988ad0b55ec677c6178df81.docx"},{"id":80790920,"identity":"c1fdae11-e437-46c6-9716-1be8362457e8","added_by":"auto","created_at":"2025-04-17 06:43:11","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":43044,"visible":true,"origin":"","legend":"","description":"","filename":"AdditionalFile1samples.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/014e9683fb02ff2455296933.xlsx"},{"id":80791620,"identity":"a3517817-7f26-4ed8-a2f7-a24a6c25417f","added_by":"auto","created_at":"2025-04-17 06:51:11","extension":"fasta","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":2993458,"visible":true,"origin":"","legend":"","description":"","filename":"AdditionalFile2deNovoPirs.fasta","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/20b9de1bd5b8cfb493284352.fasta"},{"id":80789488,"identity":"3bd89ee9-b3d2-497a-819d-ce80c5a18e00","added_by":"auto","created_at":"2025-04-17 06:35:12","extension":"csv","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":2832229,"visible":true,"origin":"","legend":"","description":"","filename":"AdditionalFile3pirClustersSubfam.csv","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/9c86c80d941fc4e5c976e8f1.csv"},{"id":80789477,"identity":"0b5133d7-2ab9-4a56-b432-5b6d9e262931","added_by":"auto","created_at":"2025-04-17 06:35:11","extension":"csv","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":405012,"visible":true,"origin":"","legend":"","description":"","filename":"AdditionalFile4pirBLAST99IDmatches.csv","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/dd676a4f6cabd3ccf26962e4.csv"},{"id":80789479,"identity":"4382a139-2f66-4461-8a46-02bc43e3881c","added_by":"auto","created_at":"2025-04-17 06:35:11","extension":"pdf","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":1014223,"visible":true,"origin":"","legend":"","description":"","filename":"SuppFig1chabStatsAsset2.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/dda4852e328f8be11760742b.pdf"},{"id":80789482,"identity":"12a646e5-9ef8-4a34-8492-f60a02e39ee3","added_by":"auto","created_at":"2025-04-17 06:35:12","extension":"pdf","order_by":7,"title":"","display":"","copyAsset":false,"role":"supplement","size":142181,"visible":true,"origin":"","legend":"","description":"","filename":"SuppFig2chabscovhspvtpmAsset1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/8fceae4708785f2bbb6ec8e5.pdf"},{"id":80789489,"identity":"db31167a-993c-4450-99e1-305f0bb5aa48","added_by":"auto","created_at":"2025-04-17 06:35:12","extension":"pdf","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":3300748,"visible":true,"origin":"","legend":"","description":"","filename":"SuppFig3netSpecAsset6.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/287dd039e5ea29b11aefd5ad.pdf"},{"id":80790924,"identity":"ae658c0f-2826-44a4-897d-026f4009943a","added_by":"auto","created_at":"2025-04-17 06:43:11","extension":"pdf","order_by":9,"title":"","display":"","copyAsset":false,"role":"supplement","size":6920,"visible":true,"origin":"","legend":"","description":"","filename":"SuppFig4pirRefsimulatedBLASTMatches.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/bf8799bc68aaceda6ef185e9.pdf"},{"id":80790928,"identity":"6003c076-f860-47d7-94b2-ca471708c8ad","added_by":"auto","created_at":"2025-04-17 06:43:12","extension":"pdf","order_by":10,"title":"","display":"","copyAsset":false,"role":"supplement","size":14559,"visible":true,"origin":"","legend":"","description":"","filename":"SuppFig5vvxSubfamPropheatmaporderCluster.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/591db0b4dd5c3b135d2ab01c.pdf"},{"id":80789484,"identity":"4331027e-52a7-4996-8d99-34a2d2c3cfe8","added_by":"auto","created_at":"2025-04-17 06:35:12","extension":"pdf","order_by":11,"title":"","display":"","copyAsset":false,"role":"supplement","size":1020965,"visible":true,"origin":"","legend":"","description":"","filename":"SuppFig6Zhu16Asset1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/eb58bd360559ef20640787a3.pdf"},{"id":80790926,"identity":"290a1620-e9ab-48ce-a57c-3a8d12129883","added_by":"auto","created_at":"2025-04-17 06:43:11","extension":"pdf","order_by":12,"title":"","display":"","copyAsset":false,"role":"supplement","size":11231,"visible":true,"origin":"","legend":"","description":"","filename":"SuppFig7AsexSal119.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/f803f669ad27ed8a7ccc4d0c.pdf"},{"id":80789483,"identity":"f228f1e4-233d-44dc-8fa2-0b7c0a468e23","added_by":"auto","created_at":"2025-04-17 06:35:12","extension":"pdf","order_by":13,"title":"","display":"","copyAsset":false,"role":"supplement","size":8704,"visible":true,"origin":"","legend":"","description":"","filename":"SuppFig8MosqPeru23P6.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5822769/v1/8df3a980440e68b065307ec2.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"De novo assembly of plasmodium interspersed repeat (pir) genes from Plasmodium vivax RNAseq data suggests geographic conservation of sub-family transcription","fulltext":[{"header":"Introduction","content":"\u003cp\u003e\u003cem\u003ePlasmodium vivax\u0026nbsp;\u003c/em\u003eis the most widely distributed species of \u003cem\u003ePlasmodium\u0026nbsp;\u003c/em\u003ethat infects humans, causing recurrent malaria. Although globally the proportion of malaria cases attributable to \u003cem\u003eP. vivax\u0026nbsp;\u003c/em\u003ehas been decreasing since 2000, \u003cem\u003eP. vivax\u003c/em\u003e still accounts for approximately 50-80% of all human malaria cases in the Americas and most of Asia [1]. It has traditionally been thought that \u003cem\u003eP. vivax\u003c/em\u003e was not a major cause of malaria in sub-Saharan African, however increasing evidence of seropositivity in African nations have led some to suggest that \u003cem\u003eP. vivax\u0026nbsp;\u003c/em\u003ein the continent is more endemic than common mantra suggests\u0026nbsp;[2\u0026ndash;4]. Given its prevalence around the world, \u003cem\u003eP. vivax\u003c/em\u003e is a major focus of public health research.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eA feature of the genome of all \u003cem\u003ePlasmodium\u003c/em\u003e species is the presence of large multigene families [5].\u0026nbsp;One of most expansive of these is the \u003cem\u003ePlasmodium interspersed repeats\u0026nbsp;\u003c/em\u003e(\u003cem\u003epir\u003c/em\u003es) family, which is present across the \u003cem\u003ePlasmodium\u0026nbsp;\u003c/em\u003elineage [6]. This includes human-infective parasites such as \u003cem\u003eP. vivax\u003c/em\u003e, \u003cem\u003eP. malariae\u0026nbsp;\u003c/em\u003eand \u003cem\u003eP. ovale\u003c/em\u003e, simian-infective parasites such as \u003cem\u003eP. cynomolgi,\u0026nbsp;\u003c/em\u003eand the rodent-infecting \u003cem\u003ePlasmodium\u003c/em\u003e species [6\u0026ndash;8]. Sometimes the gene family is named differently depending on the species, such as the \u003cem\u003evir\u003c/em\u003es in \u003cem\u003eP. vivax\u003c/em\u003e or \u003cem\u003ecir\u003c/em\u003es in \u003cem\u003eP. c. chabaudi\u003c/em\u003e, but they are all members of the \u003cem\u003epir\u003c/em\u003e gene family and are hereby referred to only as \u003cem\u003epir\u003c/em\u003es.\u003cem\u003e\u0026nbsp;\u003c/em\u003eHowever, \u003cem\u003epir\u003c/em\u003e genes are not found in species within the subgenus Laverania (such as \u003cem\u003eP. falciparum\u003c/em\u003e). Unlike the \u003cem\u003evar\u003c/em\u003e genes of \u003cem\u003eP. falciparum\u003c/em\u003e, which are known to play a role in sequestration in host endothelium [9] and contribute to severe pathology [10], relatively little is known about the exact function or binding partners of \u003cem\u003epir\u003c/em\u003e genes. However, we have previously shown that they are associated with virulence and establishment of chronic infection in rodent-infecting species [11, 12], and others have demonstrated that the surface PIR protein is involved in infected red blood cell sequestration[13, 14].\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eP. ovale, P. malariae\u0026nbsp;\u003c/em\u003eand \u003cem\u003eP. yoelii\u003c/em\u003e possess over 1000 \u003cem\u003epir\u003c/em\u003e members in their genomes, although for other species the copy number can be much fewer, such as the 134 \u003cem\u003epir\u003c/em\u003es identified in the rodent malaria parasite \u003cem\u003eP\u003c/em\u003e. \u003cem\u003eberghei\u003c/em\u003e [15\u0026ndash;18]\u003cem\u003e.\u0026nbsp;\u003c/em\u003eThe \u003cem\u003eP. vivax\u0026nbsp;\u003c/em\u003eP01 strain reference genome contains 1216 members of the \u003cem\u003epir\u003c/em\u003e gene family [19], and a recent genome assembly of Thai-origin W1 contains as many as 1145 predicted \u003cem\u003epir\u003c/em\u003e genes [20], thus constituting around a sixth of the total number of predicted genes for this organism. Using sequence clustering and phylogenomics, one can further divide this multigene family into sub-families, one clade of sub-families specific to the rodent-infective species, and another specific to simian/human-infective species [16]. A prominent exception to this is a relatively highly conserved \u003cem\u003epir\u003c/em\u003e sequence postulated to be the ancestral gene, which is found across in the genomes of the rodent-infective and simian-infective clades of \u003cem\u003ePlasmodium sp. [21].\u003c/em\u003e From \u003cem\u003eP. vivax\u0026nbsp;\u003c/em\u003egenomes sequenced across different geographic regions, it has been shown that there are differences not just in the \u003cem\u003epir\u003c/em\u003e repertoire of these isolates, but also in the proportions of \u003cem\u003epir\u003c/em\u003e sub-family members around the world [22, 23]. The dedication of a large part of the \u003cem\u003eP. vivax\u003c/em\u003e genome to a single multigene family suggests an important role, and understanding this is of high importance.\u003c/p\u003e\n\u003cp\u003eAlthough experimental \u003cem\u003eP. chabaudi\u0026nbsp;\u003c/em\u003eand \u003cem\u003eP. berghei\u003c/em\u003e infections in mice provide ready access to RNA for studying the transcriptional patterns of the \u003cem\u003epir\u003c/em\u003e genes, human malarias, such as \u003cem\u003eP. vivax\u003c/em\u003e, remain more challenging to investigate. Among the malaria parasite species, \u003cem\u003eP. vivax\u003c/em\u003e genomes are particularly diverse between isolates, and this is heavily concentrated within multigene families such as the \u003cem\u003epir\u003c/em\u003es [24].This presents an obstacle for transcriptomic studies, as \u003cem\u003epir\u003c/em\u003es\u0026nbsp;may be missed when mapping divergent isolates to a generic reference genome.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eOne approach to obtain a more comprehensive overview of the number and variability of \u003cem\u003epir\u003c/em\u003e genes in \u003cem\u003eP. vivax\u003c/em\u003e is to use \u003cem\u003ede novo\u003c/em\u003e transcriptomic assembly on the available \u003cem\u003eP. vivax\u0026nbsp;\u003c/em\u003eRNAseq data, a method in which the sequenced reads are used for assembly of the RNA being transcribed, without need of a reference genome. \u003cem\u003eDe novo\u003c/em\u003e transcriptome assembly has previously been performed for \u003cem\u003ePlasmodium\u003c/em\u003e species with missing or inadequate reference genomes, including the assembly of \u003cem\u003epir\u003c/em\u003e gene transcripts. For example, novel \u003cem\u003epir\u003c/em\u003e genes were identified in a \u003cem\u003eP. yoelii nigeriensis\u003c/em\u003e de novo transcriptomic assembly [25].\u0026nbsp;Similarly, \u003cem\u003eP. ovale\u0026nbsp;\u003c/em\u003eand a limited number of \u003cem\u003eP. vivax\u0026nbsp;\u003c/em\u003estudies [26\u0026ndash;28]\u0026nbsp;assembled\u003cem\u003e\u0026nbsp;\u003c/em\u003enovel transcripts which were annotated as \u003cem\u003epir\u003c/em\u003e-like. All these studies demonstrated marked differences between the \u003cem\u003epir\u003c/em\u003e repertoires of the reference genomes and the \u003cem\u003ede novo\u003c/em\u003e transcriptomes, suggesting that, without the new assemblies, some \u003cem\u003epir\u003c/em\u003es would have been completely missed. Using \u003cem\u003ede novo\u003c/em\u003e transcriptome assembly permits the identification of novel members of highly variable multigene families and assessment of their levels of expression during different life stages and environments, ultimately enabling deeper investigation into their functions.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eHere, we have analysed the transcription of \u003cem\u003epir\u003c/em\u003es in published datasets of \u003cem\u003eP. vivax\u0026nbsp;\u003c/em\u003eRNAseq (see Table 1), using \u003cem\u003ede novo\u003c/em\u003e transcriptome assembly to unlock a greater repertoire of the genes than those already existing in the reference genomes. We determined whether the pre-existing \u003cem\u003epir\u003c/em\u003e sub-family definitions accurately describe clusters of the \u003cem\u003ede novo-\u003c/em\u003egenerated genes. Once annotated, we questioned whether distinct sub-families are differently expressed across the parasite lifecycle and whether they vary between geographical locations. We compare the \u003cem\u003epir\u003c/em\u003e expression patterns between this human malaria parasite and rodent malaria species, which could suggest that the \u003cem\u003epir\u0026nbsp;\u003c/em\u003efamily has similar function between divergent species.\u003c/p\u003e"},{"header":"Results","content":"\u003ch2\u003e\u003cem\u003eDe novo\u003c/em\u003e transcriptome assembly generates full-length \u003cem\u003epir\u003c/em\u003e genes from samples of \u003cem\u003ePlasmodium chabaudi\u003c/em\u003e blood stages\u003c/h2\u003e\n\u003cp\u003eIn order to assemble the \u003cem\u003ede novo\u003c/em\u003e transcripts of \u003cem\u003eP. vivax pir\u003c/em\u003e genes it was first necessary to determine the most effective assembly method. For this, three different assemblers were used on RNAseq datasets from published \u003cem\u003eP. chabaudi chabaudi\u0026nbsp;\u003c/em\u003eAS RNAseq reads [29].These were then combined to identify the best transcripts using EvidentialGene [30, 31] (Figure 1), assessing the outputs for both the number of known \u003cem\u003epir\u003c/em\u003e genes recovered and the number of conserved \u003cem\u003ePlasmodium\u003c/em\u003e genes found among the transcripts (Benchmarking Universal Single-Copy Orthologs - BUSCO \u0026ndash; a score of assembly quality [32]). There is no agreed threshold for a \u0026lsquo;good\u0026rsquo; BUSCO score, so we aimed to get the highest scores that the data and software could produce. Among the individual tools, Spades-RNA [33] performed the best; however, superior results were achieved by combining the outputs of the three assemblers (see\u0026nbsp;Supplementary Figure 1). EvidentialGene was chosen as the preferred method of combining the assemblies, compared to the simple concatenation of the three assemblies and performing duplicate removal. We then checked whether there was a threshold of transcription for successful \u003cem\u003epir\u003c/em\u003e assembly by comparing the TPM expression of each gene to the quality of the best transcripts identified for the assemblies (see\u0026nbsp;Supplementary Figure 2). This demonstrated that even a small amount of transcription is enough for a \u003cem\u003epir\u003c/em\u003e to be constructed by the program, although quality is enhanced the more a gene is expressed, as expected. Only a few \u003cem\u003epir\u003c/em\u003e genes are both transcribed and not assembled to any degree, and these still are only found to be expressed at no higher than 100 TPM. Since EvidentialGene accomplished similar quality levels to simple assembly concatenation/de-duplication, but returned a smaller total number of transcripts, we suggest that it would be the better method to use for identifying unknown \u003cem\u003epir\u003c/em\u003es with higher accuracy.\u003cem\u003e\u0026nbsp;\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eDe novo\u0026nbsp;\u003c/em\u003eassembly of\u003cem\u003e\u0026nbsp;Plasmodium vivax pir\u0026nbsp;\u003c/em\u003egenes\u003cem\u003e\u0026nbsp;\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eFourteen transcriptome datasets were available from \u003cem\u003eP. vivax\u0026nbsp;\u003c/em\u003esamples, from multiple geographical sources and life-cycle stages (Table 1). The \u003cem\u003ede novo\u003c/em\u003e transcriptomes of \u003cem\u003eP. vivax\u003c/em\u003e covered a range of BUSCO quality scores and numbers of \u003cem\u003epir\u003c/em\u003e genes. Using the assemblies from \u003cem\u003eP.vivax\u003c/em\u003e blood stages, which were generally of high BUSCO quality (\u0026gt;= 50%), we identified transcription of up to 400 \u003cem\u003epir\u003c/em\u003e genes per isolate (See\u0026nbsp;Figure 2). It was notable that the number of \u003cem\u003epir\u003c/em\u003es detected does not plateau, even in the highest quality assemblies, so these assemblies were not reaching the total number of \u003cem\u003epir\u003c/em\u003es expressed in these isolates.\u003c/p\u003e\n\u003cp\u003eSome assemblies, particularly from RNA of liver-stage and many mosquito-stage samples, showed both poor BUSCO quality and a low total number of \u003cem\u003epir\u003c/em\u003e mRNA transcripts, highlighting the difficulty of obtaining enough parasite RNA for next-generation sequencing from samples containing only a small proportion of parasite material (see\u0026nbsp;Figure 2A). However, there were assemblies from multiple sporozoite samples with acceptable BUSCO quality but still with no \u003cem\u003epir\u0026nbsp;\u003c/em\u003egenes detected, suggesting that \u003cem\u003epir\u003c/em\u003e genes were not transcribed in sporozoites (see\u0026nbsp;Figure 2B). The only mosquito stage samples which showed evidence of \u003cem\u003epir\u0026nbsp;\u003c/em\u003eexpression were from the blood-meal samples of \u0026lsquo;MosqPeru23\u0026rsquo;, which is discussed in more detail at the end of the results.\u003c/p\u003e\n\u003cp\u003eFrom the asexual stages of \u003cem\u003eP. vivax\u003c/em\u003e hundreds of \u003cem\u003epir\u003c/em\u003e transcripts have been successfully identified, demonstrating the feasibility of the \u003cem\u003ede novo\u003c/em\u003e method to extract \u003cem\u003epir\u003c/em\u003e transcripts from the RNAseq reads themselves.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eDe novo\u003c/em\u003e\u003cem\u003e-\u003c/em\u003eassembled\u003cem\u003e\u0026nbsp;\u003c/em\u003e\u003cem\u003epir\u003c/em\u003e genes can be grouped into sub-families with \u003cem\u003epir\u003c/em\u003es from \u003cem\u003eP. vivax\u003c/em\u003e reference genomes and related species.\u003c/p\u003e\n\u003cp\u003eThe sub-families of the \u003cem\u003ePlasmodium vivax pir\u003c/em\u003e genes were first defined by [8], and substantially updated by [7]. These sub-families have since been found throughout the \u003cem\u003epir\u003c/em\u003e repertoires of the \u003cem\u003eP. vivax\u003c/em\u003e reference genomes P01, and W1 [19, 20],\u0026nbsp;as well as other \u003cem\u003eP. vivax\u003c/em\u003e genome assemblies and even some overlap with the \u003cem\u003epir\u003c/em\u003es of other species [16]. We sought to determine whether these sub-families describe all of the sequence diversity present in the \u003cem\u003ede novo P. vivax pir\u0026nbsp;\u003c/em\u003etranscripts. To this end a BLAST similarity network [34] was constructed using the \u003cem\u003ede novo\u0026nbsp;\u003c/em\u003esequences, the \u003cem\u003epir\u003c/em\u003es of the Sal1, P01 and W1 \u003cem\u003eP. vivax\u003c/em\u003e reference genomes, and the \u003cem\u003epir\u003c/em\u003es of the reference genomes for the closely related species \u003cem\u003eP. knowlesi\u003c/em\u003e, \u003cem\u003eP. vivax-like\u003c/em\u003e, \u003cem\u003eP. ovale, P. malariae, P. cynomolgi,\u003c/em\u003e \u003cem\u003eP. coatneyi\u003c/em\u003e, and \u003cem\u003eP. brasilianum\u003c/em\u003e [16, 35\u0026ndash;40]. A highly interconnected network was produced; however, clusters could still be resolved (see Figure 4).\u003c/p\u003e\n\u003cp\u003eThe clustering algorithm MCL [41] was used to identify sub-family clusters in the \u003cem\u003epir\u003c/em\u003e gene similarity network (consisting of reference genome, \u003cem\u003ede novo\u003c/em\u003e, and non-\u003cem\u003evivax\u003c/em\u003e \u003cem\u003epir\u003c/em\u003es), and these were compared to the previously defined sub-families from \u003cem\u003eP. vivax\u003c/em\u003e Sal1. Overall, the existing sub-family designations aligned well with the MCL clustering (see Figure 3). However, some clusters were not ascribed to any existing sub-families (e.g., cluster 10), and other existing sub-families were split between different groups (for example, sub-family E was split between clusters 1, 3, 7, 9, 15 and 19). Hence, an updated nomenclature was required. We propose simply naming the groups in order of size, appending the Sal1 name where members of the old sub-family were part of the cluster (new names shown as part of the BLAST network in Figure 4). As an example, the largest cluster, which has many E sub-family sequences from the Sal1 genome, was termed 1_E by this method. The fourth largest cluster, which overlaps with C sequences from Sal1, was called 4_C. This 4_C grouping contains the highly conserved ancestral \u003cem\u003epir\u003c/em\u003e sequence, which has orthologs across all species with canonical \u003cem\u003epir\u003c/em\u003e members in their genomes and tends to be particularly highly transcribed. The largest cluster without any assigned Sal1 sequence (although it did contain unassigned Sal1 sequences) was cluster 10, demonstrating further how well the old definitions cover newer assemblies of \u003cem\u003epir\u003c/em\u003es.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe \u003cem\u003eP. vivax\u003c/em\u003e sub-families show overlap with the \u003cem\u003epir\u003c/em\u003e repertoires of other species (see\u0026nbsp;Supplementary Figure 3), as previously observed [7, 42]. The smallest amount of cross-species sub-family sharing was with \u003cem\u003eP. coatneyi\u0026nbsp;\u003c/em\u003eand \u003cem\u003eP. knowlesi\u003c/em\u003e, with only the 2_I (the largest sub-family in both\u003cem\u003e\u0026nbsp;P. coatneyi\u0026nbsp;\u003c/em\u003eand \u003cem\u003eP. knowlesi\u003c/em\u003e), 4_C (including the ancestral sequence), and 8_I families being shared between them and \u003cem\u003eP. vivax\u003c/em\u003e (see\u0026nbsp;Figure 5). More \u003cem\u003eP. vivax\u003c/em\u003e sub-families were shared with \u003cem\u003eP. malariae\u0026nbsp;\u003c/em\u003eand \u003cem\u003eP. brasilianum\u003c/em\u003e, however almost all sub-families were shared between \u003cem\u003eP. vivax\u003c/em\u003e, \u003cem\u003eP. cynomolgi, P. ovale,\u003c/em\u003e and \u003cem\u003eP. vivax-like\u003c/em\u003e. The currently available \u003cem\u003eP. vivax-like\u003c/em\u003e genome, in contrast to every other species included, does not have representatives of group 8_I. Notably the sub-families 21_B, 17, 22, 23, and 25 were found in most of the \u003cem\u003eP. vivax\u003c/em\u003e reference genomes but not the \u003cem\u003ede novo pir\u003c/em\u003es. It is unlikely that the pipeline is failing to capture these sub-families from the expression data, as running simulated RNAseq data of the reference genome \u003cem\u003epir\u003c/em\u003es through the software did not suggest that these sub-families were systematically missed (see\u0026nbsp;Supplementary Figure 4). This supports the assertion that these groups were expressed at low levels (if at all) and would not be found in RNAseq data.\u003c/p\u003e\n\u003cp\u003eUsing the larger repertoire of \u003cem\u003epir\u003c/em\u003e sequences obtained by \u003cem\u003ede novo\u0026nbsp;\u003c/em\u003eassembly we have refined the sub-family definitions for \u003cem\u003epir\u003c/em\u003es from \u003cem\u003eP. vivax\u003c/em\u003e and related species. Overall, the clusters match the previously defined reference sub-families with minor changes.\u003c/p\u003e\n\u003ch2\u003eTranscription of the \u003cem\u003ede novo\u003c/em\u003e \u003cem\u003epir\u003c/em\u003e genes across geographies\u0026nbsp;\u003c/h2\u003e\n\u003cp\u003eThe new sub-family definitions were used to evaluate whether groups of \u003cem\u003eP. vivax\u003c/em\u003e \u003cem\u003epir\u003c/em\u003e genes were differently transcribed across isolates from distant geographical locations. For this the RNAseq quantification (alignment and feature counting of the original reads to the \u003cem\u003ede novo\u0026nbsp;\u003c/em\u003eassembly) was compared. Broadly speaking, the alignment and quantification of the \u003cem\u003ede novo\u003c/em\u003e \u003cem\u003epir\u003c/em\u003e transcription matched the results from the original studies, such as the pattern of transcription across the asexual cycle in Zhu et al., 2016 [28] (AsexThai16 - see\u0026nbsp;Supplementary Figure 6), and the relative similarity of transcription between parasites obtained from two different species of splenectomised hosts in Gunalan at al., 2019 [43] (AsexSal119 - see\u0026nbsp;Supplementary Figure 7). In\u0026nbsp;Figure 6\u0026nbsp;we show the gene expression (TPM proportions) of the \u003cem\u003epir\u003c/em\u003e sub-families in each isolate (for samples with higher \u003cem\u003epir\u003c/em\u003e numbers, for the sake of visualisation), and how this relates to BUSCO score and geographical origin. Most \u003cem\u003epir\u003c/em\u003e sub-families were present across these samples, and the larger groups always have representatives in the \u003cem\u003ede novo\u003c/em\u003e transcriptomes. At this level the main distinguisher between the samples was the number of \u003cem\u003epir\u003c/em\u003e sub-families present, as many with the lower BUSCO scores have \u003cem\u003epir\u003c/em\u003e sub-family gaps. The proportion of \u003cem\u003epir\u003c/em\u003e numbers demonstrates a similar pattern (see\u0026nbsp;Supplementary Figure 5). Overall, the largest sub-family, 1_E, tends to dominate \u003cem\u003eP. vivax\u003c/em\u003e expression profiles, followed by 4_C (the sub-family that includes the ancestral \u003cem\u003epir\u003c/em\u003e gene) and 9_E. The largest sub-family not defined in the reference Sal1 strain (\u0026lsquo;new\u0026rsquo; sub-families), cluster 10, did not appear to be transcribed very much at all. Other newly defined sub-families, 15 and 19, were transcribed more highly throughout the datasets, but still at a low level compared to the most highly expressed clusters.\u003c/p\u003e\n\u003cp\u003eThere were few clear associations between \u003cem\u003epir\u003c/em\u003e sub-families and geographical locations. Hierarchical clustering separated most of the Peruvian transcriptomes (yellow rows in figure 0.6), into a distinct group, with a higher proportion of the subfamilies 5_G, 2_I, and 8_I. However, this was not fully consistent across the Peruvian samples, and these sub-families were highly expressed in some Cambodian samples. Indeed, a negative binomial model, considering BUSCO score, sub-family membership, geographic origin, and stages of the life cycle, demonstrated that there was no statistically significant linear relationship between Peruvian origin and any \u003cem\u003epir\u003c/em\u003e sub-family. The only geographical associations that arose from these statistical tests (adjusted p \u0026lt; 0.05) were for sub-families transcribed at low levels, and the effect size was so small that the relevance was questionable. Hence, we concluded that transcription of sub-families was consistent across \u003cem\u003eP. vivax\u003c/em\u003e isolates from disparate geographical regions.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003e\u003cem\u003eDe novo pir\u003c/em\u003e assemblies suggest \u003cem\u003epir\u003c/em\u003e transcription at the ookinete stage\u0026nbsp;\u003c/h2\u003e\n\u003cp\u003eThe analysis of most mosquito samples indicated that there were few or no \u003cem\u003epir\u003c/em\u003e genes transcribed in this part of the parasite lifecycle. The few mosquito stages from which we assembled \u003cem\u003epir\u003c/em\u003e transcripts (the MosqPeru23 samples) were taken from mosquito midgut bloodmeals at different times after feeding on infected Peruvian patients. From this experiment we included three different isolates (originating from three individual patients) and these three \u003cem\u003ede novo\u003c/em\u003e assemblies all had \u0026gt;50% Complete BUSCOs and included \u0026gt;20 predicted \u003cem\u003epir\u003c/em\u003es. We investigated these samples in more detail by analysing how \u003cem\u003epir\u003c/em\u003e TPM changed across the time-course of the experiment. Given that the mosquito bloodmeal would initially include surviving asexual stages, the presence of \u003cem\u003epir\u003c/em\u003e transcripts could be explained as a holdover from these cells instead of the parasite\u0026rsquo;s specific mosquito stages (beginning with gametes). If the asexual stages were the explanation for this signal, then we would expect the gene expression to dominate earlier post-feed and then diminish over time. Instead, \u003cem\u003epir\u003c/em\u003e transcription was minimal during the initial time points, but dominated at 22 and 26 hours post-bloodmeal, although signal had vanished by 7 days (Figure 7). This was compatible with the time of ookinete development of \u003cem\u003eP. vivax\u003c/em\u003e [44], providing evidence for the presence of \u003cem\u003epir\u003c/em\u003e transcripts in the mosquito stages of this species. Few of these mosquito-specific \u003cem\u003epir\u003c/em\u003es had 99% BLAST matches to any reference genome \u003cem\u003epir\u003c/em\u003es, hence mapping to the reference genomes may have missed these signals. Overall, the TPM expression was reasonably low (\u0026lt;40 TPM) but the peak was consistent across each of the three separate isolates (see figure 7 and\u0026nbsp;Supplementary Figure 8). An alternative explanation for this observed timing of \u003cem\u003epir\u003c/em\u003e transcription could be that the later time points simply have a higher number of total reads, with the only \u003cem\u003epir\u003c/em\u003es of the overall assembly being extracted from these reads and missed from lower-read samples. The total counts of reads aligned to the \u003cem\u003ede novo\u003c/em\u003e transcriptomes, however, were not necessarily higher in the 22/26h time points compared to others (indeed, for isolate P2 the total number of aligned reads was lower for 22h than it was at the 1h or 6h time points), so this is unlikely to be the explanation. The highest expression was concentrated within 9_E \u003cem\u003epir\u003c/em\u003es, so future research could be conducted to ascertain whether there is a role of this sub-family in parasite-mosquito interactions.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe natural diversity of \u003cem\u003eP. vivax\u003c/em\u003e and its \u003cem\u003epir\u003c/em\u003e multigene family make these loci particularly difficult to study. Gene expression studies, for example, are frustrated by the reference genomes for \u003cem\u003eP. vivax\u003c/em\u003e only possessing some of the same \u003cem\u003epir\u003c/em\u003e sequences as wild isolates. To circumvent this problem and allow the study of \u003cem\u003epir\u003c/em\u003e gene expression from a diverse collection of \u003cem\u003eP.vivax\u003c/em\u003e isolates, we employed \u003cem\u003ede novo\u003c/em\u003e transcriptome assemblers to find the \u003cem\u003epir\u003c/em\u003e sequences from the RNAseq data themselves. A bespoke Nextflow pipeline [45] which employed the Trinity [46] and SPADES assemblers [33, 47], and combined them using EvidentialGene [30, 31], was successful in generating \u003cem\u003eP. c. chabaudi\u003c/em\u003e \u003cem\u003epir\u003c/em\u003e transcripts, and therefore utilized in this study. Based on the reference genomes, we would expect an upper ceiling of ~1000 \u003cem\u003epir\u003c/em\u003es from the assembly of each isolate, although the highest number of recoverable transcripts will be lower as some of these \u003cem\u003epir\u003c/em\u003es will not have enough coverage in the read data. With our pipeline, we found up to ~400 \u003cem\u003epir\u003c/em\u003e transcripts expressed by any one isolate, thousands in total across the data, covering a range of life cycle stages, geographical sources, and \u003cem\u003epir\u003c/em\u003e sub-families. From sequence similarity networks of these \u003cem\u003ede novo\u003c/em\u003e transcripts, we refined the \u003cem\u003epir\u003c/em\u003e sub-family definitions and demonstrated that they are generally transcribed in similar patterns in isolates worldwide. Gene expression of the \u003cem\u003epir\u003c/em\u003es across the life cycle re-affirmed a burst of transcription in the mosquito stages of development previously seen in \u003cem\u003eP. berghei\u003c/em\u003e, suggesting a role for this gene family in the vector of human-infecting parasites.\u003c/p\u003e\n\u003cp\u003eUsing the BUSCO [32] score to quantify the number of \u003cem\u003ePlasmodium sp.\u003c/em\u003e-conserved transcripts in each assembly (as a metric of overall quality), it was clear that the numbers of \u003cem\u003epir\u003c/em\u003e transcripts being produced was related strongly to the completeness of the overall \u003cem\u003ede novo\u003c/em\u003e transcriptome. The numbers of transcribed \u003cem\u003epir\u003c/em\u003es showed no evidence of plateauing even with high-quality assemblies, so it is likely that the total \u003cem\u003eP. vivax\u003c/em\u003e expressed \u003cem\u003epir\u003c/em\u003e repertoire was underestimated. Hundreds of transcribed \u003cem\u003epir\u003c/em\u003es may not be assembled by this method either due to limitations of the data and tools, and/or because the missing \u003cem\u003epir\u003c/em\u003es themselves were transcribed at too low a level to be picked up. \u003cem\u003ePir\u003c/em\u003es could be missed by the hidden Markov model or be excluded by the model coverage threshold. The number of \u003cem\u003epir\u003c/em\u003e sub-families found to be expressed in each assembly is also dependent on the sample quality and sequencing depth, meaning that some families may be absent due to the lack of detected reads even if they are biologically still present. This presents a limitation for concluding whether sub-families are present in given life-cycle stages or geographical regions.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eComparison of the transcriptome assembly tools threw up some further surprises. Even though rnaSPAdes was developed specifically for transcriptomes, its sister implementation SPAdes (using the option developed for single-cell genome assembly) gave similar results. A comparison by Holzer \u0026amp; Marz, 2019, between many assemblers, observed the comparable and sometimes even superior performance of SPAdes (single-cell) across multiple metrics, so the single-cell specific algorithm does improve transcript assembly in certain contexts. Trinity is often the default choice for bioinformaticians, including in some of the original studies included in this analysis [27, 28], however it was much less effective for the assembling of \u003cem\u003epir\u003c/em\u003e genes than the SPAdes tools. Trinity also produced the lowest BUSCO scores, however the decline in assembly quality was small, while the decline in \u003cem\u003epir\u0026nbsp;\u003c/em\u003etranscript recovery was stark. Since \u003cem\u003epir\u003c/em\u003es are generally transcribed at low levels, the scant reads from these transcripts may have been removed by Trinity\u0026rsquo;s initial k-mer ranking and filtering algorithm. A future execution of this workflow could include more \u003cem\u003ede novo\u003c/em\u003e transcriptome assemblers to potentially expand and improve upon the repertoire of \u003cem\u003epir\u003c/em\u003es detected. Each assembler has its own strengths and weaknesses and can work better for certain species over others [48\u0026ndash;55], so one stands to gain further transcriptional insight by including more tools like Bridger [56], Trans-ABySS [57], and SOAPdenovo-trans [58], among others. EvidentialGene worked well to create a meta-assembly from the outputs of the three individual assemblers, proving the usefulness of its algorithms in such a pipeline. This software is not as fully documented as many other pieces of bioinformatic programming, and it would benefit from more expansive explanations.\u003c/p\u003e\n\u003cp\u003eMuch of our understanding of putative \u003cem\u003epir\u003c/em\u003e functions and transcriptional kinetics comes from rodent-infecting parasites. Consistent evidence from \u003cem\u003eP. berghei,\u003c/em\u003e across multiple stages of the life cycle and different experiments, show some expression in the mosquito (especially at the oocyst stage) and increasing expression upon entering the mammalian host [18, 29, 59, 60]. Here we found - from one experiment, MosqPeru23 - that around 22 hours after \u003cem\u003eP. vivax\u003c/em\u003e-infected bloodmeal uptake by the mosquito there is a small but consistent surge in \u003cem\u003epir\u003c/em\u003e transcripts, coinciding with the conversion of zygotes to ookinetes. A signal from the 9_E sub-family was notably strong at this time-point, suggesting that its members have a role in the mosquito. To our knowledge this is the first indication that ookinetes of a human-infective \u003cem\u003ePlasmodium spp.\u0026nbsp;\u003c/em\u003eexpress \u003cem\u003epir\u003c/em\u003e genes, following on from the first mosquito-stage \u003cem\u003epir\u003c/em\u003e transcription identified in the \u003cem\u003eP. berghei\u003c/em\u003e malaria cell atlas [60]. Together these data suggest that some \u003cem\u003epir\u003c/em\u003e genes may have a role in the mosquito vector.\u003c/p\u003e\n\u003cp\u003eWe were not able to gain insight into the \u003cem\u003epir\u003c/em\u003e transcriptomes of certain life-cycle stages, such as liver stages due to a lack of appropriate data. Gural et al., 2018 [61], had to contend with a massive abundance of host mRNA compared to parasite mRNA, and so they employed hybrid capture sequences to enrich for the \u003cem\u003eP. vivax\u003c/em\u003e transcripts. The capture sequences were based on the P01 reference genome, so it is unlikely that the sequencing results would have contained novel \u003cem\u003epir\u003c/em\u003e sequences for extraction. For the murine-infective parasite \u003cem\u003eP. berghei\u003c/em\u003e the malaria cell atlas [60] has shown trophozoite-like \u003cem\u003epir\u003c/em\u003e transcription in the liver stages, and [59], have shown that PIR proteins are expressed in the late liver stages, demonstrating that the genes may play a role also in the exo-erythrocytic stages of other malaria parasites and represent an important avenue of future work.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003ePir\u003c/em\u003e genes of both \u003cem\u003eP. berghei\u003c/em\u003e and \u003cem\u003eP. chabaudi\u003c/em\u003e demonstrate especially high transcription during the asexual blood stages, however \u003cem\u003eP. berghei\u003c/em\u003e shows greater sexual dimorphism of \u003cem\u003epir\u003c/em\u003e transcription (with male gametocytes enriched for \u003cem\u003epir\u003c/em\u003e expression) [62]. \u003cem\u003eP. chabaudi\u003c/em\u003e infections in mice particularly demonstrate the importance of the gene family in the rodent asexual stages, providing evidence of an association between \u003cem\u003epir\u003c/em\u003e transcription and the virulence of infection or chronic recrudescence [11, 12]. We can make comparisons between rodent \u003cem\u003epir\u003c/em\u003e transcription and the \u003cem\u003eP. vivax\u003c/em\u003e life-cycle stages that gave good quality assemblies. The data of [28] (AsexThai16), which sampled across the intra-eyrthrocytic development cycle (IDC) of \u003cem\u003eP. vivax\u003c/em\u003e, suggest that although \u003cem\u003epir\u003c/em\u003e transcription is unimodal over the cycle, the timing of this peak within the asexual cycle is different for \u003cem\u003eP. vivax\u003c/em\u003e and \u003cem\u003eP. c. chabaudi [29]\u003c/em\u003e; a result observable in the original paper and verified here (Supplementary Figure 6). For \u003cem\u003eP. c. chabaudi\u0026nbsp;\u003c/em\u003ethe peak of \u003cem\u003epir\u0026nbsp;\u003c/em\u003etranscription occurs around the time of schizont differentiation, while for \u003cem\u003eP. vivax\u003c/em\u003e the peak is around schizont bursting and ring-stage formation. Microarray comparisons between multiple \u003cem\u003ePlasmodium sp.\u0026nbsp;\u003c/em\u003eusing 1-1 orthologs found that the genes with the most variation in timing across the genus were enriched for those transcribed primarily during the early ring and early schizont stages, so it may be common for IDC genes to show altered patterns of expression [63]. The interpretation of the \u003cem\u003eP. vivax\u003c/em\u003e IDC data is complicated by the fact that it is generated from \u003cem\u003ein vitro\u003c/em\u003e culture, especially problematic at the 48h time-points when toxicity of the culture could be causing artefacts (\u003cem\u003eP. vivax\u003c/em\u003e cannot be cultured long-term [64\u0026ndash;66]). Investigations from natural infections are impeded by the asynchronous nature of the \u003cem\u003eP. vivax\u003c/em\u003e IDC. Single-cell RNAseq provides the best opportunity to gain access to the individual stages of the IDC. Additionally, sequencing of sexual stages would be of additional interest since the mouse-infective species \u003cem\u003eP. berghei\u003c/em\u003e and \u003cem\u003eP. c. chabaudi\u003c/em\u003e show contrasting patterns of \u003cem\u003epir\u0026nbsp;\u003c/em\u003etranscription between sexes; both show enriched expression in males, although only \u003cem\u003eP. c. chabaudi\u003c/em\u003e has a female gametocyte-specific subset too [62]. Although there is a conserved transcription of \u003cem\u003epir\u003c/em\u003es across malaria species, the exact timing may differ between simian- and mice-infective clades.\u003c/p\u003e\n\u003cp\u003eWe used a BLAST similarity network to define \u003cem\u003eP. vivax\u003c/em\u003e sub-families, showing that the groupings previously annotated before high-throughput sequencing broadly still describe the community structure of gene sequences found in isolates obtained worldwide. This was already suggested by existing reference genome assemblies, which originated from different countries but all had overlapping sequences. Nonetheless, our similarity network suggested that some definitions needed updating, such as the largest sub-family E which is better described as multiple sub-families. Clusters on the network (corresponding to sub-families) of note include the largest sub-family 1_E, which tends to dominate \u003cem\u003eP. vivax\u003c/em\u003e expression profiles, and 4_C, the sub-family that includes the ancestral \u003cem\u003epir\u003c/em\u003e gene for each isolate [21]. It is curious that the ancestral \u003cem\u003epir\u003c/em\u003e is part of a greater sub-family instead of on a lone lineage, as is observed in phylogenies of the murine-infective \u003cem\u003ePlasmodium sp.\u003c/em\u003e \u003cem\u003epir\u003c/em\u003es. Similar to other species the ancestral \u003cem\u003epir\u003c/em\u003e is often highly transcribed, however many \u003cem\u003ede novo\u0026nbsp;\u003c/em\u003eassemblies did not include the sequence. This could be because the assembly software did not \u0026lsquo;find\u0026rsquo; this transcript among the RNAseq reads. Only two of the \u003cem\u003eP. c. chabaudi\u003c/em\u003e \u003cem\u003ede novo\u003c/em\u003e assemblies did not give rise to \u0026gt;95% identity BLAST matches of the ancestral \u003cem\u003epir\u003c/em\u003e, and these were both from the outputs of individual assembler tools (all the combined assemblies assembled accurate full-length ancestral transcripts). Transcription of the ancestral \u003cem\u003epir\u003c/em\u003e may be relatively lower in \u003cem\u003eP. vivax\u003c/em\u003e compared to other species, including \u003cem\u003eP. c. chabaudi\u0026nbsp;\u003c/em\u003e[29]. If the ancestral \u003cem\u003epir\u003c/em\u003e is downregulated in \u003cem\u003eP. vivax\u003c/em\u003e compared to the rodent malaria parasites, this could be due to functional redundancy from the other similar members of the 4_C \u003cem\u003evivax\u003c/em\u003e \u003cem\u003epir\u003c/em\u003e sub-family. Other pre-defined sub-families showed persistently low transcription/presence in the assemblies, including cluster 11 (K), cluster 12 (J), and cluster 21 (B). These \u003cem\u003epir\u003c/em\u003es are likely un-expressed and may only be found through genome sequencing.\u003c/p\u003e\n\u003cp\u003eSince this study used RNAseq data, we tested whether the distribution of \u003cem\u003epir\u0026nbsp;\u003c/em\u003esub-families across geographical localities held true for \u003cem\u003epir\u003c/em\u003es that are actually transcribed, and indeed it did. Even when statistical tests were conducted to test for differences in \u003cem\u003epir\u003c/em\u003e sub-family transcription across the world, only small associations of questionable relevance were found. Evidence suggests that, with a few exceptions, the genotypes of wild \u003cem\u003eP. vivax\u003c/em\u003e isolates tend to cluster by their geographic source [67]. A particularly recent founder effect can be found in the American populations of \u003cem\u003eP. vivax\u003c/em\u003e, which likely derive from European colonization\u0026nbsp;[67, 68]. There does not appear to be any reflection of these geographically restricted clades in the data presented here. If we assume that the spread of \u003cem\u003eP. vivax\u003c/em\u003e strains through human movement has not overwritten regional distinctions, this result suggests that the sub-families of the \u003cem\u003epir\u003c/em\u003es are relatively unchanged since the last common ancestor of \u003cem\u003eP. vivax\u003c/em\u003e, and hence they either have a purpose that the parasite needs to preserve, or that not enough time has passed for them to significantly diverge. Estimates of the timing of the most recent common ancestor of \u003cem\u003eP. vivax\u003c/em\u003e range from around 50\u0026ndash;300,000 years ago [24, 69\u0026ndash;72], a relatively short evolutionary time. However, the genetic variation of multigene families like the \u003cem\u003epir\u003c/em\u003es should lead to accelerated change and loss/gain of sub-family copy numbers or transcriptional rates. Expression differences of individual \u003cem\u003epir\u003c/em\u003e genes between samples from different continents have been observed before when aligning to the P01 genome \u003cem\u003epir\u003c/em\u003es [73], however we suggest that this does not expand to the overall sub-families themselves. New \u003cem\u003eP. vivax\u003c/em\u003e genomes/transcriptomes from different regions of the world continue to be sequenced, offering opportunities to incorporate them into the pipeline and challenge this conclusion [73\u0026ndash;75]. Some isolates of interest may be found from patients on the China-Myanmar border in the upper Mekong, which have previously been shown to have a particularly large C sub-family [76].Only a single Papua New Guinean sample was available for this study, however it showed a distinct profile and was an outlier in the relationship between BUSCO score and \u003cem\u003epir\u003c/em\u003e number (having a potent abundance of \u003cem\u003epir\u003c/em\u003es despite a relatively incomplete assembly). Analysis of orthologue diversity across \u003cem\u003eP. vivax\u003c/em\u003e isolates from multiple countries demonstrated that PNG parasites were particularly diverse compared to those from other regions [67], so it is plausible that this variation is reflected in multigene families too. Deep RNA-sequencing and \u003cem\u003ede novo\u003c/em\u003e assembly of PNG-sourced \u003cem\u003eP. vivax\u003c/em\u003e transcriptomes could shed light on whether the \u003cem\u003epir\u003c/em\u003es of these pathogens are unique to the country.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eDe novo\u0026nbsp;\u003c/em\u003etranscriptomics can be employed across diverse \u003cem\u003eP. vivax\u003c/em\u003e RNAseq datasets in order to observe the transcription of \u003cem\u003epir\u003c/em\u003e sequences missing from the reference genomes. This method unveiled a refined sub-family structure and demonstrated that these sub-families are expressed in dispersed parasite isolates around the world, suggesting that they have functions that lead to their retention. Although thought previously to have no role in mosquito stages, now evidence exists from both \u003cem\u003eP. berghei\u003c/em\u003e and \u003cem\u003eP. vivax\u003c/em\u003e that there is indeed \u003cem\u003epir\u003c/em\u003e gene expression in the vector, opening new opportunities to understand this gene family.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cp\u003eExperimental information from Carlos et al., unpublished (MosqPeru23)\u003c/p\u003e \u003cp\u003eFemale \u003cem\u003eAn darlingi\u003c/em\u003e mosquitoes were fed on three patients (P2, P5, and P6) from Iquitos, Peru, with active \u003cem\u003eP. vivax\u003c/em\u003e infections as determined from blood smears. At each timepoint post-bloodmeal and for each patient isolate 30\u0026ndash;45 midguts were dissected and the RNA processed for Illumina NextSeq 500 sequencing.\u003c/p\u003e \u003cp\u003e \u003cem\u003ePlasmodium spp.\u003c/em\u003e data download\u003c/p\u003e \u003cp\u003e \u003cem\u003eP. vivax\u003c/em\u003e datasets were downloaded using the FetchNGS pipeline v1.5 available from nf-core using Nextflow v22.10.3 and Singularity v3.6.4. Default settings were used except for \u0026lsquo;--force_sratools_download\u0026rsquo; to use SRA-toolkit for the download, and an additional custom config file (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/timslittle/Thesis_Github_221023/blob/main/custom.config\u003c/span\u003e\u003cspan address=\"https://github.com/timslittle/Thesis_Github_221023/blob/main/custom.config\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) supplied via \u0026lsquo;-c\u0026rsquo; to permit the SRA-toolkit function \u0026lsquo;prefetch\u0026rsquo; to use larger files than default. The configuration profile used for all Nextflow pipelines was \u0026lsquo;Singularity\u0026rsquo; and a locally maintained Crick profile (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/nf-core/configs/blob/master/docs/crick.md\u003c/span\u003e\u003cspan address=\"https://github.com/nf-core/configs/blob/master/docs/crick.md\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). The exceptions were the Carlos et al. (manuscript in preparation) samples and the \u003cem\u003eP. c. chabaudi\u003c/em\u003e test samples, which were downloaded directly from lab storage, and the [\u003cspan citationid=\"CR77\" class=\"CitationRef\"\u003e77\u003c/span\u003e] samples, which were downloaded from cloud storage (kindly provided by the authors).\u003c/p\u003e \u003cp\u003eAt this point files were concatenated together if they constituted the same isolate of \u003cem\u003eP. vivax\u003c/em\u003e, to maximise the amount of information for the transcriptome assemblers, but minimise the erroneous assembly of transcripts from stitching reads of different sources. Where data source was unclear from the original publication, the samples were kept separate for assembly, such as for [\u003cspan citationid=\"CR78\" class=\"CitationRef\"\u003e78\u003c/span\u003e]. See Supp File 1 for the full list of files concatenated together and the rationale for this.\u003c/p\u003e \u003cp\u003e \u003cem\u003eDe novo\u003c/em\u003e transcriptome assembly pipeline with Nextflow\u003c/p\u003e \u003cp\u003eThe pipeline for \u003cem\u003ede novo\u003c/em\u003e assembly construction, \u0026lsquo;transcript_corral\u0026rsquo; (development version 1, commit: 6b93401, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/timslittle/nf-core-transcriptcorral/tree/6b93401496098d2759e875a7611cf0eb1b4268c4\u003c/span\u003e\u003cspan address=\"https://github.com/timslittle/nf-core-transcriptcorral/tree/6b93401496098d2759e875a7611cf0eb1b4268c4\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e)\u003c/span\u003e was written with Nextflow v22.10.3 and nf-core/tools v2.7.dev0, based on an early design and concept by Matthew McGowan. The pipeline was submitted with the parameters: \u0026lsquo;-profile singularity,crick --skip_trimming false --remove_ribo_rna true --filter_genome [/path/to/genome_to_filter.fa] --assemble_trinity true --assemble_spades_sc true --assemble_spades_rna true --use_evigene true --hmmsearch_hmmfile [/path/to/Pfam-Plasmodium_Vir.hmm] --busco_lineage 'plasmodium_odb10' --salmon_alignment true --salmon_gtf false\u0026rsquo;. In summary, the pipeline begins by concatenating files that require concatenation (see above), running Trim_Galore! v0.6.7 (which uses cutadapt v3.4) to trim adapter sequences, and running FastQC (v0.11.9) for quality analysis. Reads within the RNAseq datasets that align to potential contaminant genomes were then removed using HISAT2 v2.1.0 with parameters \u0026lsquo;-q -x --un-conc-gz\u0026rsquo;, with the last parameter saving the sequences which do not align concordantly to the reference [\u003cspan citationid=\"CR79\" class=\"CitationRef\"\u003e79\u003c/span\u003e]. The contaminant genome was a concatenation of the \u003cem\u003eMus musculus\u003c/em\u003e genome (due to the mouse fibroblasts in the [\u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e61\u003c/span\u003e] system), the \u003cem\u003eHomo sapiens\u003c/em\u003e genome GRCh38 [\u003cspan citationid=\"CR80\" class=\"CitationRef\"\u003e80\u003c/span\u003e], the mosquito \u003cem\u003eAnopheles dirus\u003c/em\u003e WRAIR2 AdirW1 genome (due to the mosquito-sourced samples of [\u003cspan citationid=\"CR81\" class=\"CitationRef\"\u003e81\u003c/span\u003e]) [\u003cspan citationid=\"CR82\" class=\"CitationRef\"\u003e82\u003c/span\u003e, \u003cspan citationid=\"CR83\" class=\"CitationRef\"\u003e83\u003c/span\u003e] and \u003cem\u003eAnopheles darlingii\u003c/em\u003e AdarC3 (due to the Carlos et al., unpublished, samples) [\u003cspan citationid=\"CR84\" class=\"CitationRef\"\u003e84\u003c/span\u003e], as well as the primate \u003cem\u003eSaimiri boliviensis\u003c/em\u003e SaiBol1 genome [\u003cspan citationid=\"CR85\" class=\"CitationRef\"\u003e85\u003c/span\u003e] and \u003cem\u003eAotus nancymaae\u003c/em\u003e GCA_000952055.2 genome [\u003cspan citationid=\"CR86\" class=\"CitationRef\"\u003e86\u003c/span\u003e] (the monkeys used in [\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e]). The HISAT2 alignment removal was performed with the relevant parameters altered for the single-end and paired-end stranded \u003cem\u003eP. vivax\u003c/em\u003e samples: for the stranded libraries of Kim et al., 2017 and 2019 [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e, \u003cspan citationid=\"CR87\" class=\"CitationRef\"\u003e87\u003c/span\u003e] samples HISAT2 was performed with \u0026lsquo;--rna-strandness FR\u0026rsquo;; for the single-end libraries of Muller et al., 2019 [\u003cspan citationid=\"CR88\" class=\"CitationRef\"\u003e88\u003c/span\u003e], and Gural et al., 2018 [\u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e61\u003c/span\u003e], HISAT2 was performed with \u0026lsquo;--un-gz\u0026rsquo; instead of \u0026lsquo;--un-conc-gz\u0026rsquo;. rRNA reads were subsequently removed using sortmerna v4.3.4 [\u003cspan citationid=\"CR89\" class=\"CitationRef\"\u003e89\u003c/span\u003e]. These output sequences were then used for assembly using sc-spades and rna-spades (SPAdes v3.15.4 with options \u0026lsquo;--sc\u0026rsquo; and \u0026lsquo;--rna\u0026rsquo; respectively), and Trinity v2.13.2. The assemblies were combined, and redundant sequences removed, using EvidentialGene version 22may07, which assesses similar sequences for the most optimal representative transcript/peptide [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. In brief, CDS sequences are identified and scored, perfect duplicates and fragments are removed, alternative transcripts identified using 98% identity BLAST alignments of coding sequence, and a primary transcript is finally identified. To ascertain the quality of the final meta-assemblies they were ran through BUSCO v5.4.3 with \u0026lsquo;plasmodium_odb10\u0026rsquo; as the reference, to see how many expected, conserved \u003cem\u003ePlasmodium\u003c/em\u003e genes are recovered [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e, \u003cspan citationid=\"CR90\" class=\"CitationRef\"\u003e90\u003c/span\u003e]. HMMer v3.3.2 is used to identify the sequences which resemble the Plasmodium_Vir Pfam model (PF05795) [\u003cspan citationid=\"CR91\" class=\"CitationRef\"\u003e91\u003c/span\u003e], then matches with an E-value of less than 1e-3 were extracted [\u003cspan citationid=\"CR92\" class=\"CitationRef\"\u003e92\u003c/span\u003e]. Transcriptional quantification of the samples was used to both detect which of the proposed \u003cem\u003epir\u003c/em\u003e transcripts have RNAseq reads map back to them, and to evaluate how the \u003cem\u003epir\u003c/em\u003es are being transcribed. All the original biological replicates (after Trim_Galore!, HISAT2, and sortmerna processing) were aligned to their corresponding assemblies using the transcript-aware aligner Salmon v1.9.0 [\u003cspan citationid=\"CR93\" class=\"CitationRef\"\u003e93\u003c/span\u003e]. The parameters \u0026lsquo;-p 8 --validateMappings -I A\u0026rsquo; were used for all alignments, with \u0026lsquo;-l SR\u0026rsquo; specified instead of \u0026lsquo;-I A\u0026rsquo; for the stranded libraries. Salmon calculates TPM automatically, so these results were used directly to filter the HMMer ORF outputs for the PIR sequences with evidence of transcription (greater than or equal to 1 TPM in at least one sample). The pipeline was run in separate batches to reduce load on computational resources at any given time.\u003c/p\u003e \u003cp\u003eTo check whether the pipeline failed to assemble/identify any given sub-families of \u003cem\u003epir\u003c/em\u003e genes, simulated RNAseq reads were generated based on the Sal1, P01, and W1 reference genomes (excluding duplicate sequences) using Rsubread v 2.0.1 [\u003cspan citationid=\"CR94\" class=\"CitationRef\"\u003e94\u003c/span\u003e] simReads with \u0026lsquo;paired.end\u0026thinsp;=\u0026thinsp;TRUE\u0026rsquo; and all relative TPMs set to 1 (ensuring that all transcripts are \u0026lsquo;expressed\u0026rsquo; at the same level, controlling for transcript length). The simulated RNAseq fastq files were ran through the pipeline as described above, and the output \u003cem\u003epir\u003c/em\u003e transcripts were compared to the reference genome \u003cem\u003epir\u003c/em\u003es using tBLASTn v2.9.0 (\u0026lsquo;-evalue 1e-3 -max_target_seqs 500 -outfmt \"6 std qcovs qcovhsp slen nident\u0026rsquo;).\u003c/p\u003e \u003cp\u003eFor the mosquito blood-meal RNAseq results of MosqPeru23 (Carlos et al., manuscript in preparation), the 18h samples were excluded from the final figure due to this timepoint consistently showing globally distinct transcriptional profiles with zero \u003cem\u003epir\u003c/em\u003e expression. The reason for this is unclear, and may be due to biological signal (e.g. a transition time for the parasite between zygote and ookinete forms) or a technical artefact.\u003c/p\u003e \u003cp\u003e \u003cem\u003eP. c. chabaudi\u003c/em\u003e AS benchmarking of the \u003cem\u003ede novo\u003c/em\u003e transcriptome method to find \u003cem\u003epir\u003c/em\u003e genes\u003c/p\u003e \u003cp\u003eThe 24h asexual cycle samples from Little and Cunningham et al., 2021, [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e] were used to test this method for finding \u003cem\u003epir\u003c/em\u003e genes in RNAseq datasets, The Nextflow pipeline was run for each assembler separately, then with all assemblers together to produce a meta-assembly. To remove duplicates from these assemblies they were processed by CD-HIT v 4.8.1. To compare the use of EvidentialGene for meta-assembly construction, the meta-assembly was also run through EvidentialGene instead of CD-HIT. Instead of running HMMer and finding \u003cem\u003epir\u003c/em\u003e sequence, the peptide outputs of the pipeline were matched to the known PIR peptides from PlasmoDB v61 using BLAST v2.9.0 with parameters \u0026ldquo;-evalue 1e-3 -num_threads 6 -max_target_seqs 500 -outfmt \"6 std qcovs qcovhsp slen nident\".\u003c/p\u003e \u003cp\u003eCreating networks of PIRs and sub-family assignment\u003c/p\u003e \u003cp\u003eTo create networks of the PIR ORFs and work out which sub-family they belong to/how well existing definitions describe these PIRs, I downloaded the already published PIR sequences from \u003cem\u003eP. vivax\u003c/em\u003e strains P01, Sal1, and W1, and \u003cem\u003eP. ovale\u003c/em\u003e, \u003cem\u003eP. vivax-like\u003c/em\u003e, \u003cem\u003eP. malariae, P. brasilianum, P. knowlesi, P. coatneyi\u003c/em\u003e, and \u003cem\u003eP. cynomolgi\u003c/em\u003e, using a PlasmoDB v62 search for \u0026ldquo;PIR protein\u0026rdquo; or \u0026ldquo;VIR protein\u0026rdquo; in \u0026lsquo;Product Description\u0026rsquo;, as well as the a Pfam search for PF05795: Plasmodium_Vir and PF06022: Cir_Bir_Yir Variant antigen (adding the results of both the text and PFAM search together), then finally filtering out pseudogenes. These \u0026lsquo;known\u0026rsquo; PIRs were combined with the \u003cem\u003ede novo\u003c/em\u003e PIRs and BLASTp v2.9.0 was performed between this dataset and itself (\u0026lsquo;-evalue 1e-3 -num_threads 6 -max_target_seqs 500 -outfmt \"6 std qcovs qcovhsp slen nident\"\u0026rsquo;).\u003c/p\u003e \u003cp\u003eMCL clustering was performed with varying inflation values (influencing the number of clusters) of 1.2, 1.4, 1.8, 2, 2.5, 3, 4 and 6 (mcxload parameters: \u0026lsquo;--stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)'\u0026rsquo;; mcl and mcxdump parameters all default) [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e]. As reported in the results, the best compromise between minimal total number of clusters and minimal mixing of Sal1 sub-family sequences was found with an MCL inflation value of 1.4. From these MCL clusters sub-families were assigned based on the majority Sal1 sub-family sequence in the cluster. If a cluster had no assigned Sal1 sub-family sequence (note that there may be unassigned Sal1 sequence present which were not identified as belonging to any sub-family in the original study) then it was marked as a \u0026lsquo;New\u0026rsquo; sub-family. These were then numbered if the assigned name was not unique.\u003c/p\u003e \u003cp\u003eThe network was visualised using Gephi [\u003cspan citationid=\"CR95\" class=\"CitationRef\"\u003e95\u003c/span\u003e] and the OpenOrd layout algorithm [\u003cspan citationid=\"CR96\" class=\"CitationRef\"\u003e96\u003c/span\u003e] with 25% Liquid stage, 100% Expansion stage, 15% Cooldown stage and no Crunch or Simmer stage of the simulated annealing process, 0.8 edge cut, 7 threads, 750 iterations, 0.2 fixed time and a random seed of -9, followed by briefly running the Noverlap algorithm to reduce overlapping nodes with speed set to 3, ratio 0.5 and margin 5.\u003c/p\u003e \u003cp\u003eStatistical analysis and figure production in R\u003c/p\u003e \u003cp\u003eData analysis was conducted using stringr [\u003cspan citationid=\"CR97\" class=\"CitationRef\"\u003e97\u003c/span\u003e], dplyr[\u003cspan citationid=\"CR98\" class=\"CitationRef\"\u003e98\u003c/span\u003e] and data.table [\u003cspan citationid=\"CR99\" class=\"CitationRef\"\u003e99\u003c/span\u003e], figures were drawn using ggplot2 [\u003cspan citationid=\"CR100\" class=\"CitationRef\"\u003e100\u003c/span\u003e], ComplexHeatmap [\u003cspan citationid=\"CR101\" class=\"CitationRef\"\u003e101\u003c/span\u003e], circlize[\u003cspan citationid=\"CR102\" class=\"CitationRef\"\u003e102\u003c/span\u003e] and viridis[\u003cspan citationid=\"CR103\" class=\"CitationRef\"\u003e103\u003c/span\u003e] packages in R v3.6.2 [\u003cspan citationid=\"CR104\" class=\"CitationRef\"\u003e104\u003c/span\u003e]. The negative binomial model was fitted using the base R function \u0026lsquo;glm.nb\u0026rsquo; with the formula \u0026lsquo;TPM\u0026thinsp;~\u0026thinsp;sub-family * country * lifecycle_stage\u0026thinsp;+\u0026thinsp;BUSCO_score\u0026rsquo;.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e\u003cem\u003epir\u003c/em\u003es\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003e\u003cem\u003ePlasmodium interspersed repeats\u003c/em\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eHMM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eHidden Markov Model\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eRNAseq\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eHigh-throughput RNA sequencing\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eBUSCO\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eBenchmarking Universal Single-Copy Orthologs\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eTPM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eTranscripts-per-Million\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eIDC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50%;\"\u003e\n \u003cp\u003eIntra-eyrthrocytic Development Cycle\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eCode availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe code used in generating figures and performing analysis is available at https://github.com/timslittle/pirDeNovo_2025.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported by the Francis Crick Institute which receives its core funding from the UK Medical Research Council (CC2079), Cancer Research UK (CC2079) and the Wellcome Trust (CC2079); JL was a Wellcome Trust Senior Investigator (grant reference WT101777MA); TSL was the recipient of an Imperial College/Francis Crick PhD stipend.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026apos; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eConceptualisation: TSL, DAC, GKC, AJR, JL. Analysis: TSL. Supervision: GKC, AJR, JL. Writing\u0026mdash;review \u0026amp; editing: TSL, DAC, AJR, JL.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eMost datasets generated in this study were obtained from public databases. The MosqPeru23 datasets are available from the European Nucleotide Archive at PRJEB29445.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors are grateful to the original scientists who published the studies re-analysed here, as well as Jayme Souza Neto, Bianca Cechetto Carlos, and Dina Vlachou for permitting us to use the MosqPeru23 data prior to release and publication. Additionally, the authors thank Dr. George Young (MRC Laboratory of Medical Sciences, formerly of The Francis Crick Institute), for his help with bioinformatic analysis.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNone declared.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eWHO. World Malaria Report 2021. Geneva; 2021.\u003c/li\u003e\n\u003cli\u003eBaird JK. African \u003cem\u003ePlasmodium vivax\u003c/em\u003e malaria improbably rare or benign. Trends Parasitol. 2022;0.\u003c/li\u003e\n\u003cli\u003eCulleton R, Ndounga M, Zeyrek FY, Coban C, Casimiro PN, Takeo S, et al. Evidence for the Transmission of \u003cem\u003ePlasmodium vivax\u003c/em\u003e in the Republic of the Congo, West Central Africa. J Infect Dis. 2009;200:1465\u0026ndash;9.\u003c/li\u003e\n\u003cli\u003eKho S, Qotrunnada L, Leonardo L, Andries B, Wardani PAI, Fricot A, et al. Hidden Biomass of Intact Malaria Parasites in the Human Spleen. New England Journal of Medicine. 2021;384:2067\u0026ndash;9.\u003c/li\u003e\n\u003cli\u003eReid AJ. Large, rapidly evolving gene families are at the forefront of host\u0026ndash;parasite interactions in Apicomplexa. Parasitology. 2015;142:S57\u0026ndash;70.\u003c/li\u003e\n\u003cli\u003eJanssen CS, Phillips RS, Turner MR, Barret MP. \u003cem\u003ePlasmodium interspersed repeats\u003c/em\u003e: The major multigene superfamily of malaria parasites. Nucleic Acids Res. 2004;32:5712\u0026ndash;20.\u003c/li\u003e\n\u003cli\u003eLopez FJ, Bernabeu M, Fernandez-Becerra C, del Portillo HA. A new computational approach redefines the subtelomeric \u003cem\u003evir\u003c/em\u003e superfamily of \u003cem\u003ePlasmodium vivax\u003c/em\u003e. BMC Genomics. 2013;14:8.\u003c/li\u003e\n\u003cli\u003edel Portillo HA, Fernandez-Becerra C, Bowman S, Oliver K, Preuss M, Sanchez CP, et al. A superfamily of variant genes encoded in the subtelomeric region of \u003cem\u003ePlasmodium vivax\u003c/em\u003e. Nature. 2001;410:839\u0026ndash;42.\u003c/li\u003e\n\u003cli\u003eSu X, Heatwole VM, Wertheimer SP, Guinet F, Herrfeldt JA, Peterson DS, et al. The large diverse gene family \u003cem\u003evar\u003c/em\u003e encodes proteins involved in cytoadherence and antigenic variation of \u003cem\u003ePlasmodium falciparum\u003c/em\u003e-infected erythrocytes. Cell. 1995;82:89\u0026ndash;100.\u003c/li\u003e\n\u003cli\u003eFried M, Duffy PE. Adherence of \u003cem\u003ePlasmodium falciparum\u003c/em\u003e to Chondroitin Sulfate A in the Human Placenta. Science (1979). 1996;272:1502\u0026ndash;4.\u003c/li\u003e\n\u003cli\u003eSpence PJ, Jarra W, L\u0026eacute;vy P, Reid AJ, Chappell L, Brugat T, et al. Vector transmission regulates immune control of \u003cem\u003ePlasmodium\u003c/em\u003e virulence. Nature. 2013;498:228\u0026ndash;31.\u003c/li\u003e\n\u003cli\u003eBrugat T, Reid AJ, Lin J, Cunningham D, Tumwine I, Kushinga G, et al. Antibody-independent mechanisms regulate the establishment of chronic \u003cem\u003ePlasmodium\u003c/em\u003e infection. Nat Microbiol. 2017;2:16276.\u003c/li\u003e\n\u003cli\u003eBernabeu M, Lopez FJ, Ferrer M, Martin-Jaular L, Razaname A, Corradin G, et al. Functional analysis of \u003cem\u003ePlasmodium vivax\u003c/em\u003e VIR proteins reveals different subcellular localizations and cytoadherence to the ICAM-1 endothelial receptor. Cell Microbiol. 2012;14:386\u0026ndash;400.\u003c/li\u003e\n\u003cli\u003eFernandez-Becerra C, Bernabeu M, Castellanos A, Correa BR, Obadia T, Ramirez M, et al. \u003cem\u003ePlasmodium vivax\u003c/em\u003e spleen-dependent genes encode antigens associated with cytoadhesion and clinical protection. Proceedings of the National Academy of Sciences. 2020;:201920596.\u003c/li\u003e\n\u003cli\u003eAnsari HR, Templeton TJ, Subudhi AK, Ramaprasad A, Tang J, Lu F, et al. Genome-scale comparison of expanded gene families in \u003cem\u003ePlasmodium ovale wallikeri\u003c/em\u003e and \u003cem\u003ePlasmodium ovale curtisi\u003c/em\u003e with \u003cem\u003ePlasmodium malariae\u003c/em\u003e and with other Plasmodium species. Int J Parasitol. 2016;46:685\u0026ndash;96.\u003c/li\u003e\n\u003cli\u003eRutledge GG, B\u0026ouml;hme U, Sanders M, Reid AJ, Cotton JA, Maiga-Ascofare O, et al. \u003cem\u003ePlasmodium malariae\u003c/em\u003e and \u003cem\u003eP. ovale\u003c/em\u003e genomes provide insights into malaria parasite evolution. Nature. 2017;542:101\u0026ndash;4.\u003c/li\u003e\n\u003cli\u003eCarlton JM, Angiuoli S V., Suh BB, Kooij TW, Pertea M, Silva JC, et al. Genome sequence and comparative analysis of the model rodent malaria parasite \u003cem\u003ePlasmodium yoelii yoelii\u003c/em\u003e. Nature. 2002;419:512\u0026ndash;9.\u003c/li\u003e\n\u003cli\u003eOtto TD, B\u0026ouml;hme U, Jackson AP, Hunt M, Franke-Fayard B, Hoeijmakers WAM, et al. A comprehensive evaluation of rodent malaria parasite genomes and gene expression. BMC Biol. 2014;12:86.\u003c/li\u003e\n\u003cli\u003eAuburn S, B\u0026ouml;hme U, Steinbiss S, Trimarsanto H, Hostetler J, Sanders M, et al. A new \u003cem\u003ePlasmodium vivax\u003c/em\u003e reference sequence with improved assembly of the subtelomeres reveals an abundance of \u003cem\u003epir\u003c/em\u003e genes. Wellcome Open Res. 2016;1:4.\u003c/li\u003e\n\u003cli\u003eMinassian AM, Themistocleous Y, Silk SE, Barrett JR, Kemp A, Quinkert D, et al. Controlled human malaria infection with a clone of \u003cem\u003ePlasmodium vivax\u003c/em\u003e with high-quality genome assembly. JCI Insight. 2021;6.\u003c/li\u003e\n\u003cli\u003eFrech C, Chen N. Variant surface antigens of malaria parasites: functional and evolutionary insights from comparative gene family classification and analysis. BMC Genomics. 2013;14:427.\u003c/li\u003e\n\u003cli\u003eMerino EF, Fernandez-Becerra C, Durham AM, Ferreira JE, Tumilasci VF, d\u0026rsquo;Arc-Neves J, et al. Multi-character population study of the \u003cem\u003evir\u003c/em\u003e subtelomeric multigene superfamily of \u003cem\u003ePlasmodium vivax\u003c/em\u003e, a major human malaria parasite. Mol Biochem Parasitol. 2006;149:10\u0026ndash;6.\u003c/li\u003e\n\u003cli\u003eChen S-BB, Wang Y, Kassegne K, Xu B, Shen H-MM, Chen J-HH. Whole-genome sequencing of a \u003cem\u003ePlasmodium vivax\u003c/em\u003e clinical isolate exhibits geographical characteristics and high genetic variation in China-Myanmar border area. BMC Genomics. 2017;18:131.\u003c/li\u003e\n\u003cli\u003eNeafsey DE, Galinsky K, Jiang RHY, Young L, Sykes SM, Saif S, et al. The malaria parasite \u003cem\u003ePlasmodium vivax\u003c/em\u003e exhibits greater genetic diversity than \u003cem\u003ePlasmodium falciparum\u003c/em\u003e. Nat Genet. 2012;44:1046\u0026ndash;50.\u003c/li\u003e\n\u003cli\u003eZhang C, Oguz C, Huse S, Xia L, Wu J, Peng YC, et al. Genome sequence, transcriptome, and annotation of rodent malaria parasite \u003cem\u003ePlasmodium yoelii nigeriensis\u003c/em\u003e N67. BMC Genomics. 2021;22:1\u0026ndash;12.\u003c/li\u003e\n\u003cli\u003eBrashear AM, Roobsoong W, Siddiqui FA, Nguitragool W, Sattabongkot J, L\u0026oacute;pez-Uribe MM, et al. A glance of the blood stage transcriptome of a Southeast Asian \u003cem\u003ePlasmodium ovale\u003c/em\u003e isolate. PLoS Negl Trop Dis. 2019;13:e0007850.\u003c/li\u003e\n\u003cli\u003eKim A, Popovici J, Vantaux A, Samreth R, Bin S, Kim S, et al. Characterization of \u003cem\u003eP. vivax\u003c/em\u003e blood stage transcriptomes from field isolates reveals similarities among infections and complex gene isoforms. Sci Rep. 2017;7:7761.\u003c/li\u003e\n\u003cli\u003eZhu L, Mok S, Imwong M, Jaidee A, Russell B, Nosten F, et al. New insights into the \u003cem\u003ePlasmodium vivax\u003c/em\u003e transcriptome using RNA-Seq. Sci Rep. 2016;6:20498.\u003c/li\u003e\n\u003cli\u003eLittle TS, Cunningham DA, Vandomme A, Lopez CT, Amis S, Alder C, et al. Analysis of \u003cem\u003epir\u003c/em\u003e gene expression across the \u003cem\u003ePlasmodium\u003c/em\u003e life cycle. Malar J. 2021;20:1\u0026ndash;14.\u003c/li\u003e\n\u003cli\u003eGilbert D. Accurate \u0026amp;amp; complete gene construction with EvidentialGene. F1000Res. 2016;5.\u003c/li\u003e\n\u003cli\u003eGilbert DG. Genes of the pig, \u003cem\u003eSus scrofa\u003c/em\u003e, reconstructed with EvidentialGene. PeerJ. 2019;7:e6374.\u003c/li\u003e\n\u003cli\u003eSim\u0026atilde;o FA, Waterhouse RM, Ioannidis P, Kriventseva E V., Zdobnov EM. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210\u0026ndash;2.\u003c/li\u003e\n\u003cli\u003eBushmanova E, Antipov D, Lapidus A, Prjibelski AD. rnaSPAdes: a \u003cem\u003ede novo\u003c/em\u003e transcriptome assembler and its application to RNA-Seq data. Gigascience. 2019;8:1\u0026ndash;13.\u003c/li\u003e\n\u003cli\u003eAltschul SF, Madden TL, Sch\u0026auml;ffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST:a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389\u0026ndash;402.\u003c/li\u003e\n\u003cli\u003ePain A, B\u0026ouml;hme U, Berry AE, Mungall K, Finn RD, Jackson AP, et al. The genome of the simian and human malaria parasite \u003cem\u003ePlasmodium knowlesi\u003c/em\u003e. Nature. 2008;455:799\u0026ndash;803.\u003c/li\u003e\n\u003cli\u003eBajic M, Ravishankar S, Sheth M, Rowe LA, Pacheco MA, Patel DS, et al. The first complete genome of the simian malaria parasite Plasmodium brasilianum. Scientific Reports 2022 12:1. 2022;12:1\u0026ndash;13.\u003c/li\u003e\n\u003cli\u003ePasini EM, B\u0026ouml;hme U, Rutledge GG, Voorberg-Van der Wel A, Sanders M, Berriman M, et al. An improved \u003cem\u003ePlasmodium cynomolgi\u003c/em\u003e genome assembly reveals an unexpected methyltransferase gene expansion. Wellcome Open Res. 2017;2 May:42.\u003c/li\u003e\n\u003cli\u003eTachibana S-I, Sullivan SA, Kawai S, Nakamura S, Kim HR, Goto N, et al. Plasmodium cynomolgi genome sequences provide insight into Plasmodium vivax and the monkey malaria clade. Nat Genet. 2012;44:1051\u0026ndash;5.\u003c/li\u003e\n\u003cli\u003eChien J-T, Pakala SB, Geraldo JA, Lapp SA, Humphrey JC, Barnwell JW, et al. High-Quality Genome Assembly and Annotation for Plasmodium coatneyi, Generated Using Single-Molecule Real-Time PacBio Technology. Genome Announc. 2016;4.\u003c/li\u003e\n\u003cli\u003eGilabert A, Otto TD, Rutledge GG, Franzon B, Ollomo B, Arnathau C, et al. Plasmodium vivax-like genome sequences shed new insights into Plasmodium vivax biology and evolution. PLoS Biol. 2018;16:e2006035.\u003c/li\u003e\n\u003cli\u003eVan Dongen S, Abreu-Goodger C. Using MCL to Extract Clusters from Networks. Methods in Molecular Biology. 2012;804:281\u0026ndash;95.\u003c/li\u003e\n\u003cli\u003eAnsari HR, Templeton TJ, Subudhi AK, Ramaprasad A, Tang J, Lu F, et al. Genome-scale comparison of expanded gene families in \u003cem\u003ePlasmodium ovale wallikeri\u003c/em\u003e and \u003cem\u003ePlasmodium ovale curtisi\u003c/em\u003e with \u003cem\u003ePlasmodium malariae\u003c/em\u003e and with other Plasmodium species. Int J Parasitol. 2016;46:685\u0026ndash;96.\u003c/li\u003e\n\u003cli\u003eGunalan K, S\u0026aacute; JM, Moraes Barros RR, Anzick SL, Caleon RL, Mershon JP, et al. Transcriptome profiling of \u003cem\u003ePlasmodium vivax\u003c/em\u003e in \u003cem\u003eSaimiri\u003c/em\u003e monkeys identifies potential ligands for invasion. Proc Natl Acad Sci U S A. 2019;116.\u003c/li\u003e\n\u003cli\u003eZollner GE, Ponsa N, Garman GW, Poudel S, Bell JA, Sattabongkot J, et al. Population dynamics of sporogony for Plasmodium vivax parasites from western Thailand developing within three species of colonized Anopheles mosquitoes. Malar J. 2006;5:1\u0026ndash;17.\u003c/li\u003e\n\u003cli\u003eDi Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nature Biotechnology. 2017;35:316\u0026ndash;9.\u003c/li\u003e\n\u003cli\u003eGrabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644\u0026ndash;52.\u003c/li\u003e\n\u003cli\u003eBankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455\u0026ndash;77.\u003c/li\u003e\n\u003cli\u003eMamrot J, Legaie R, Ellery SJ, Wilson T, Seemann T, Powell DR, et al. \u003cem\u003eDe novo\u003c/em\u003e transcriptome assembly for the spiny mouse (\u003cem\u003eAcomys cahirinus\u003c/em\u003e). Scientific Reports 2017 7:1. 2017;7:1\u0026ndash;15.\u003c/li\u003e\n\u003cli\u003eChopra R, Burow G, Farmer A, Mudge J, Simpson CE, Burow MD. Comparisons of \u003cem\u003ede novo\u003c/em\u003e transcriptome assemblers in diploid and polyploid species using peanut (\u003cem\u003eArachis spp.\u003c/em\u003e) RNA-Seq data. PLoS One. 2014;9:e115055.\u003c/li\u003e\n\u003cli\u003eAmin S, Prentis PJ, Gilding EK, Pavasovic A. Assembly and annotation of a non-model gastropod (\u003cem\u003eNerita melanotragus\u003c/em\u003e) transcriptome: A comparison of \u003cem\u003ede novo\u003c/em\u003e assemblers. BMC Res Notes. 2014;7:1\u0026ndash;8.\u003c/li\u003e\n\u003cli\u003eFrancis WR, Christianson LM, Kiko R, Powers ML, Shaner NC, D Haddock SH. A comparison across non-model animals suggests an optimal sequencing depth for \u003cem\u003ede novo\u003c/em\u003e transcriptome assembly. BMC Genomics. 2013;14:1\u0026ndash;12.\u003c/li\u003e\n\u003cli\u003eMadritsch S, Burg A, Sehr EM. Comparing \u003cem\u003ede novo\u003c/em\u003e transcriptome assembly tools in di- and autotetraploid non-model plant species. BMC Bioinformatics. 2021;22:1\u0026ndash;17.\u003c/li\u003e\n\u003cli\u003eYahav T, Privman E. A comparative analysis of methods for \u003cem\u003ede novo\u003c/em\u003e assembly of hymenopteran genomes using either haploid or diploid samples. Sci Rep. 2019;9:1\u0026ndash;10.\u003c/li\u003e\n\u003cli\u003eRana SB, Zadlock FJ, Zhang Z, Murphy WR, Bentivegna CS. Comparison of \u003cem\u003ede novo\u003c/em\u003e transcriptome assemblers and k-mer strategies using the killifish, \u003cem\u003eFundulus heteroclitus\u003c/em\u003e. PLoS One. 2016;11:e0153104.\u003c/li\u003e\n\u003cli\u003eH\u0026ouml;lzer M, Marz M. \u003cem\u003eDe novo\u003c/em\u003e transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers. Gigascience. 2019;8:1\u0026ndash;16.\u003c/li\u003e\n\u003cli\u003eChang Z, Li G, Liu J, Zhang Y, Ashby C, Liu D, et al. Bridger: A new framework for \u003cem\u003ede novo\u003c/em\u003e transcriptome assembly using RNA-seq data. Genome Biol. 2015;16:1\u0026ndash;10.\u003c/li\u003e\n\u003cli\u003eBirol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, et al. De novo transcriptome assembly with ABySS. Bioinformatics. 2009;25:2872\u0026ndash;7.\u003c/li\u003e\n\u003cli\u003eXie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, et al. SOAPdenovo-Trans: \u003cem\u003ede novo\u003c/em\u003e transcriptome assembly with short RNA-Seq reads. Bioinformatics. 2014;30:1660\u0026ndash;6.\u003c/li\u003e\n\u003cli\u003ePasini EM, Braks JA, Fonager J, Klop O, Aime E, Spaccapelo R, et al. Proteomic and Genetic Analyses Demonstrate that \u003cem\u003ePlasmodium berghei\u003c/em\u003e Blood Stages Export a Large and Diverse Repertoire of Proteins. Molecular \u0026amp; Cellular Proteomics. 2013;12:426\u0026ndash;48.\u003c/li\u003e\n\u003cli\u003eHowick VM, Russell AJC, Andrews T, Heaton H, Reid AJ, Natarajan K, et al. The Malaria Cell Atlas: Single parasite transcriptomes across the complete \u003cem\u003ePlasmodium\u003c/em\u003e life cycle. Science (1979). 2019;365:eaaw2619.\u003c/li\u003e\n\u003cli\u003eGural N, Mancio-Silva L, Miller AB, Galstian A, Butty VL, Levine SS, et al. In Vitro Culture, Drug Sensitivity, and Transcriptome of \u003cem\u003ePlasmodium vivax\u003c/em\u003e Hypnozoites. Cell Host Microbe. 2018;23:395-406.e4.\u003c/li\u003e\n\u003cli\u003eCunningham DA, Reid AJ, Hosking C, Deroost K, Tumwine-Downey I, Sanders M, et al. Identification of gametocyte-associated pir genes in the rodent malaria parasite, Plasmodium chabaudi chabaudi AS. BMC Res Notes. 2023;16.\u003c/li\u003e\n\u003cli\u003eHoo R, Zhu L, Amaladoss A, Mok S, Natalang O, Lapp SA, et al. Integrated analysis of the \u003cem\u003ePlasmodium\u003c/em\u003e species transcriptome. EBioMedicine. 2016;7:255\u0026ndash;66.\u003c/li\u003e\n\u003cli\u003eBass CC, Johns FM. THE CULTIVATION OF MALARIAL PLASMODIA (PLASMODIUM VIVAX AND PLASMODIUM FALCIPARUM) IN VITRO. Journal of Experimental Medicine. 1912;16:567\u0026ndash;79.\u003c/li\u003e\n\u003cli\u003eNoulin F, Borlon C, Van Den Abbeele J, D\u0026rsquo;Alessandro U, Erhart A. 1912-2012: A century of research on Plasmodium vivax in vitro culture. Trends Parasitol. 2013;29:286\u0026ndash;94.\u003c/li\u003e\n\u003cli\u003eThomson JG, Thomson D, Fantham HB. The cultivation of one generation of benign tertian malarial parasites (plasmodium vivax) in vitro, by bass\u0026rsquo;s method. Ann Trop Med Parasitol. 1913;7:153\u0026ndash;64.\u003c/li\u003e\n\u003cli\u003eHupalo DN, Luo Z, Melnikov A, Sutton PL, Rogov P, Escalante A, et al. Population genomics studies identify signatures of global dispersal and drug resistance in Plasmodium vivax. Nat Genet. 2016;48:953\u0026ndash;8.\u003c/li\u003e\n\u003cli\u003eImwong M, Nair S, Pukrittayakamee S, Sudimack D, Williams JT, Mayxay M, et al. Contrasting genetic structure in Plasmodium vivax populations from Asia and South America. Int J Parasitol. 2007;37:1013\u0026ndash;22.\u003c/li\u003e\n\u003cli\u003eMu J, Joy DA, Duan J, Huang Y, Carlton J, Walker J, et al. Host Switch Leads to Emergence of Plasmodium vivax Malaria in Humans. Mol Biol Evol. 2005;22:1686\u0026ndash;93.\u003c/li\u003e\n\u003cli\u003eDaron J, Boissi\u0026egrave;re A, Boundenga L, Ngoubangoye B, Houze S, Arnathau C, et al. Population genomic evidence of Plasmodium vivax Southeast Asian origin. Sci Adv. 2021;7.\u003c/li\u003e\n\u003cli\u003eDaron J, Boissi\u0026egrave;re A, Boundenga L, Ngoubangoye B, Houze S, Arnathau C, et al. Population genomic evidence of Plasmodium vivax Southeast Asian origin. Sci Adv. 2021;7.\u003c/li\u003e\n\u003cli\u003ePrajapati SK, Joshi H, Carlton JM, Rizvi MA. Neutral Polymorphisms in Putative Housekeeping Genes and Tandem Repeats Unravels the Population Genetics and Evolutionary History of Plasmodium vivax in India. PLoS Negl Trop Dis. 2013;7:e2425.\u003c/li\u003e\n\u003cli\u003eKepple D, Ford CT, Williams J, Abagero B, Li S, Popovici J, et al. Comparative transcriptomics reveal differential gene expression among Plasmodium vivax geographical isolates and implications on erythrocyte invasion mechanisms. PLoS Negl Trop Dis. 2024;18:e0011926.\u003c/li\u003e\n\u003cli\u003eDe Meulenaere K, Cuypers B, Gamboa D, Laukens K, Rosanas-Urgell A. A new Plasmodium vivax reference genome for South American isolates. BMC Genomics. 2023;24:1\u0026ndash;14.\u003c/li\u003e\n\u003cli\u003eCallejas-Hern\u0026aacute;ndez F, Nikulkova M, Adamski N, Yan G, Yewhalaw D, Carlton JM. Assembled genome of an Ethiopian Plasmodium vivax isolate generated using GridION long-read technology. Microbiol Resour Announc. 2024;13:e00590-24.\u003c/li\u003e\n\u003cli\u003eBrashear AM, Huckaby AC, Fan Q, Dillard LJ, Hu Y, Li Y, et al. New \u003cem\u003ePlasmodium vivax\u003c/em\u003e Genomes From the China-Myanmar Border. Front Microbiol. 2020;0:1930.\u003c/li\u003e\n\u003cli\u003eBourgard C, Lopes SCP, Lacerda MVG, Albrecht L, Costa FTM. A suitable RNA preparation methodology for whole transcriptome shotgun sequencing harvested from \u003cem\u003ePlasmodium vivax\u003c/em\u003e-infected patients. Sci Rep. 2021;11:5089.\u003c/li\u003e\n\u003cli\u003eRoth A, Adapa SR, Zhang M, Liao X, Saxena V, Goffe R, et al. Unraveling the \u003cem\u003ePlasmodium vivax\u003c/em\u003e sporozoite transcriptional journey from mosquito vector to human host. Sci Rep. 2018;8:12183.\u003c/li\u003e\n\u003cli\u003eKim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12:357\u0026ndash;60.\u003c/li\u003e\n\u003cli\u003eSchneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849\u0026ndash;64.\u003c/li\u003e\n\u003cli\u003eBoonkaew T, Mongkol W, Prasert S, Paochan P, Yoneda S, Nguitragool W, et al. Transcriptome analysis of \u003cem\u003eAnopheles dirus\u003c/em\u003e and \u003cem\u003ePlasmodium vivax\u003c/em\u003e at ookinete and oocyst stages. Acta Trop. 2020;207:105502.\u003c/li\u003e\n\u003cli\u003eAmos B, Aurrecoechea C, Barba M, Barreto A, Basenko EY, Bażant W, et al. VEuPathDB: the eukaryotic pathogen, vector and host bioinformatics resource center. Nucleic Acids Res. 2022;50:D898\u0026ndash;911.\u003c/li\u003e\n\u003cli\u003eNeafsey DE, Waterhouse RM, Abai MR, Aganezov SS, Alekseyev MA, Allen JE, et al. Mosquito genomics. Highly evolvable malaria vectors: the genomes of 16 Anopheles mosquitoes. Science (1979). 2015;347.\u003c/li\u003e\n\u003cli\u003eMarinotti O, Cerqueira GC, de Almeida LGP, Ferro MIT, da Silva Loreto EL, Zaha A, et al. The genome of Anopheles darlingi, the main neotropical malaria vector. Nucleic Acids Res. 2013;41:7387\u0026ndash;400.\u003c/li\u003e\n\u003cli\u003eChiou KL, Pozzi L, Lynch Alfaro JW, di Fiore A. Pleistocene diversification of living squirrel monkeys (\u003cem\u003eSaimiri spp.\u003c/em\u003e) inferred from complete mitochondrial genome sequences. Mol Phylogenet Evol. 2011;59:736\u0026ndash;45.\u003c/li\u003e\n\u003cli\u003eBabb PL, Fernandez-Duque E, Baiduc CA, Gagneux P, Evans S, Schurr TG. mtDNA diversity in azara\u0026rsquo;s owl monkeys (\u003cem\u003eAotus azarai azarai\u003c/em\u003e) of the Argentinean Chaco. Am J Phys Anthropol. 2011;146:209\u0026ndash;24.\u003c/li\u003e\n\u003cli\u003eKim A, Popovici J, Menard D, Serre D. \u003cem\u003ePlasmodium vivax\u003c/em\u003e transcriptomes reveal stage-specific chloroquine response and differential regulation of male and female gametocytes. Nat Commun. 2019;10:371.\u003c/li\u003e\n\u003cli\u003eMuller I, Jex AR, Kappe SHI, Mikolajczak SA, Sattabongkot J, Patrapuvich R, et al. Transcriptome and histone epigenome of \u003cem\u003ePlasmodium vivax\u003c/em\u003e salivary-gland sporozoites point to tight regulatory control and mechanisms for liver-stage differentiation in relapsing malaria. Int J Parasitol. 2019;49:501\u0026ndash;13.\u003c/li\u003e\n\u003cli\u003eKopylova E, No\u0026eacute; L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012;28:3211\u0026ndash;7.\u003c/li\u003e\n\u003cli\u003eManni M, Berkeley MR, Seppey M, Sim\u0026atilde;o FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 2021;38:4647\u0026ndash;54.\u003c/li\u003e\n\u003cli\u003eMistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021;49:D412\u0026ndash;9.\u003c/li\u003e\n\u003cli\u003eEddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7:1002195.\u003c/li\u003e\n\u003cli\u003ePatro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14:417\u0026ndash;9.\u003c/li\u003e\n\u003cli\u003eLiao Y, Smyth GK, Shi W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 2019;47:e47\u0026ndash;e47.\u003c/li\u003e\n\u003cli\u003eBastian M, Heymann S, Jacomy M. Gephi: An Open Source Software for Exploring and Manipulating Networks. 2009.\u003c/li\u003e\n\u003cli\u003eMartin S, Brown WM, Klavans R, Boyack KW. OpenOrd: an open-source toolbox for large graph layout. In: Visualization and Data Analysis 2011. SPIE; 2011. p. 786806.\u003c/li\u003e\n\u003cli\u003eWickham H. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. 2019.\u003c/li\u003e\n\u003cli\u003eWickham H, Romain F, Henry L, Kirill M. dplyr: A Grammar of Data Manipulation. R package version 0.8.1. 2019. https://cran.r-project.org/package=dplyr.\u003c/li\u003e\n\u003cli\u003eDowle M, Srinivasan A. data.table: Extension of `data.frame`. R package version. 2019.\u003c/li\u003e\n\u003cli\u003eWickham H. ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag; 2016.\u003c/li\u003e\n\u003cli\u003eGu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32:2847\u0026ndash;9.\u003c/li\u003e\n\u003cli\u003eGu Z, Gu L, Eils R, Schlesner M, Brors B. Circlize implements and enhances circular visualization in R. Bioinformatics. 2014;30:2811\u0026ndash;2.\u003c/li\u003e\n\u003cli\u003eGarnier S. viridis: Default Color Maps from \u0026ldquo;matplotlib\u0026rdquo;. R package version 0.5.1. 2018.\u003c/li\u003e\n\u003cli\u003eR Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2018. https://www.r-project.org/.\u003c/li\u003e\n\u003cli\u003eCheng CW, Jongwutiwes S, Putaporntip C, Jackson AP. Clinical expression and antigenic profiles of a Plasmodium vivax vaccine candidate: merozoite surface protein 7 (PvMSP-7). Malar J. 2019;18:197.\u003c/li\u003e\n\u003cli\u003eCollins WE, Contacos PG, Krotoski WA, Howard WA. Transmission of four Central American strains of Plasmodium vivax from monkey to man. J Parasitol. 1972;58:332\u0026ndash;5.\u003c/li\u003e\n\u003cli\u003eSiegel S v., Chappell L, Hostetler JB, Amaratunga C, Suon S, B\u0026ouml;hme U, et al. Analysis of Plasmodium vivax schizont transcriptomes from field isolates reveals heterogeneity of expression of genes involved in host-parasite interactions. Sci Rep. 2020;10:16667.\u003c/li\u003e\n\u003cli\u003eRangel GW, Clark MA, Kanjee U, Goldberg JM, MacInnis B, Jos\u0026eacute; Menezes M, et al. Plasmodium vivax transcriptional profiling of low input cryopreserved isolates through the intraerythrocytic development cycle. PLoS Negl Trop Dis. 2020;14:e0008104.\u003c/li\u003e\n\u003cli\u003eDe Meulenaere K, Prajapati SK, Villasis E, Cuypers B, Kattenberg JH, Kasian B, et al. Band 3\u0026ndash;mediated Plasmodium vivax invasion is associated with transcriptional variation in PvTRAg genes. Front Cell Infect Microbiol. 2022;12.\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Table 1","content":"\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eReference\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eLife cycle stages\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eStrains used\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cstrong\u003eExperiment code\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eZhu et al., 2016 [28]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eIDC (different lengths of time in \u003cem\u003eex vivo\u003c/em\u003e culture)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNW Thai patient isolates (Mae Sot)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAsexThai16\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eKim et al., 2017 [27]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMixed blood stages (and one sporozoite)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eRatanakiri, Cambodia\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMixdCamb17\u003c/p\u003e\n \u003cp\u003e(Same authors as MixdCamb19)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eGural et al., 2018 [61]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMixed liver stages and \u0026lsquo;hypnozoite-enriched\u0026rsquo; (\u003cem\u003ein vitro\u0026nbsp;\u003c/em\u003ehuman organelle system)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eThai patient isolates\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLivrThai18\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eRoth et al., 2018 [78]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSporozoites\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMyanmar border, Thailand\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSpztThai18\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eCheng et al., 2019 [105]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAsexual blood stages\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e5 N Thai and 5 S Thai patient isolates\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAsexThai19\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eMuller et al., 2019 [88]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSporozoites\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eThai patient isolates (Tak/Ubon)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSpztThai19\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eKim et al., 2019 [87]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMixed blood stages (with differential counts)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e26 Ratanakiri, Cambodia patient isolates\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMixdCamb19 (Same authors as MixdCamb17)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eGunalan et al., 2019 [43]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAsexual blood stages from two species of monkey\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cem\u003eP. vivax\u003c/em\u003e Sal I (El Salvador origin [106])\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAsexSal119\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eBoonkaew et al., 2020 [81]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eOokinete (18h) and Oocyst (7days)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eThai patient isolates (Ubon and Yala)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMosqThai20\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eSiegal et al., 2020 [107]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSchizonts\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eCambodia patient isolates\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSchzMixd20\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eRangel et al., 2020 [108]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eIDC (4, 20, 36, 44 and 72 hours post-thaw)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAcre, Brazil\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAsexBraz20\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eBourgard et al., 2021 [77]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAsexual blood stages\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eManaus, Brazil\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAsexBraz21\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003ede Meulenaere et al., 2022 [109]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSchizonts\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eIquitos, Peru, and Madang, Papua New Guinea\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSchzMixd22\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eCarlos et al., in preparation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMosquito blood-meal/midguts (1h, 6h, 22h, 26h, and 7d)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eIquitos, Peru\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMosqPeru23\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"4\"\u003e\n \u003cp\u003e\u0026nbsp;Table 1 \u0026ndash; Summary of the publications from which \u003cem\u003eP. vivax\u003c/em\u003e RNAseq data was taken and used for \u003cem\u003ede novo\u0026nbsp;\u003c/em\u003eassembly. The study reference, life cycle stages/conditions extracted, geographic source of the isolates and experimental code used in this paper are all included. \u0026nbsp;A descriptive short identifier was given to each experiment/dataset, stating the life cycle stages covered (e.g. \u0026lsquo;Asex\u0026rsquo; for \u0026lsquo;Asexual stages\u0026rsquo;, \u0026lsquo;Mix\u0026rsquo; if multiple stages are included), the country of origin (e.g.\u0026rsquo;Camb\u0026rsquo; for \u0026lsquo;Cambodia\u0026apos;, \u0026lsquo;Mix\u0026rsquo; if multiple geographical sources are involved), and year of publication. Note that the \u0026lsquo;Asexual\u0026rsquo; stages have not been explicitly depleted for sexual stages such as gametocytes or sexual-lineage schizonts, however these should be a minor contaminant if present.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-genomics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"gics","sideBox":"Learn more about [BMC Genomics](http://bmcgenomics.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/gics","title":"BMC Genomics","twitterHandle":"#BMCGenomics","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Malaria, Vivax, Transcriptomics, Pir, Multigene","lastPublishedDoi":"10.21203/rs.3.rs-5822769/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5822769/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eBackground\u003c/strong\u003e:\u003c/p\u003e\n\u003cp\u003eThe \u003cem\u003eplasmodium interspersed repeats\u003c/em\u003e (\u003cem\u003epir\u003c/em\u003e) multigene family is found across malaria parasite genomes, first discovered in the human-infecting species \u003cem\u003ePlasmodium vivax\u003c/em\u003e, where they were initially named the \u003cem\u003evir\u003c/em\u003es. Their function remains unknown, although studies have suggested a role in virulence of the asexual blood stages. Sub-families of the \u003cem\u003eP. vivax pir/vir\u003c/em\u003es have been identified, and are found in isolates from across the world, however their transcription at different localities and in different stages of the life cycle have not been quantified. Multiple transcriptomic studies of the parasite have been conducted, but many map the \u003cem\u003epir\u003c/em\u003e reads to existing reference genomes (as part of standard bioinformatic practice), which may miss members of the multigene family due to its inherent variability. This obscures our understanding of how the \u003cem\u003epir\u003c/em\u003e sub-families in \u003cem\u003eP. vivax\u003c/em\u003e may be contributing to human/vector infection.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo overcome the issue of hidden \u003cem\u003epir\u003c/em\u003ediversity from utilising a reference genome, we employed \u003cem\u003ede novo\u003c/em\u003etranscriptome assembly to construct the \u003cem\u003epir\u003c/em\u003e ‘reference’ of different parasite isolates from published and novel RNAseq datasets. For this purpose, a pipeline was written in Nextflow, and first tested on data from the rodent-infecting \u003cem\u003eP. c. chabaudi\u003c/em\u003e parasite to ascertain its efficacy on a sample with a full, genome-based set of \u003cem\u003epir\u003c/em\u003e gene sequences. The pipeline assembled hundreds of \u003cem\u003epir\u003c/em\u003es from the studies included. By performing BLAST sequence identity comparisons with reference genome \u003cem\u003epir\u003c/em\u003es (including \u003cem\u003eP. vivax\u003c/em\u003e and related species) we found a clustered network of transcripts which corresponded well with prior sub-family annotations, albeit requiring some updated nomenclature. Mapping the RNAseq datasets to the \u003cem\u003ede novo \u003c/em\u003etranscriptome\u003cem\u003e \u003c/em\u003ereferences revealed that the transcription of these updated \u003cem\u003epir\u003c/em\u003e gene sub-families is generally consistent across the different geographical regions. From this transcriptional quantification, a time course of mosquito bloodmeals (after feeding on an infected patient) highlighted the first evidence of ookinete stage \u003cem\u003epir\u003c/em\u003e transcription in a human-infective malaria parasite.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusions:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eDe novo\u003c/em\u003e transcriptome assembly is a valuable tool for understanding highly variable multigene families from \u003cem\u003ePlasmodium spp\u003c/em\u003e., and with pipeline software these can be applied more easily and at scale. Despite a global distribution, \u003cem\u003eP. vivax\u003c/em\u003ehas a conserved \u003cem\u003epir\u003c/em\u003e sub-family structure - both in terms of genome copy number and transcription. We suggest that this indicates important roles of the distinct sub-families, or a genetic mechanism maintaining their preservation. Furthermore, a burst of \u003cem\u003epir\u003c/em\u003e transcription in the mosquito stages of development is the first glint of ookinete \u003cem\u003epir\u003c/em\u003e expression for a human-infective malaria parasite, suggesting a role for the gene family at a new stage of the lifecycle.\u003c/p\u003e","manuscriptTitle":"De novo assembly of plasmodium interspersed repeat (pir) genes from Plasmodium vivax RNAseq data suggests geographic conservation of sub-family transcription","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-04-17 06:35:06","doi":"10.21203/rs.3.rs-5822769/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-04-29T05:48:26+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-04-29T03:00:36+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-04-28T03:11:31+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"139424037635415687856567625944452145676","date":"2025-04-14T16:24:10+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"286869195955398016393028046490365464179","date":"2025-04-14T14:19:53+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-04-14T09:02:20+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-04-14T05:10:53+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Genomics","date":"2025-04-11T22:33:31+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-genomics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"gics","sideBox":"Learn more about [BMC Genomics](http://bmcgenomics.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/gics","title":"BMC Genomics","twitterHandle":"#BMCGenomics","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"7b3ac790-5299-41de-aab9-c47e6c870ad6","owner":[],"postedDate":"April 17th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-06-02T15:59:48+00:00","versionOfRecord":{"articleIdentity":"rs-5822769","link":"https://doi.org/10.1186/s12864-025-11752-1","journal":{"identity":"bmc-genomics","isVorOnly":false,"title":"BMC Genomics"},"publishedOn":"2025-05-29 15:57:09","publishedOnDateReadable":"May 29th, 2025"},"versionCreatedAt":"2025-04-17 06:35:06","video":"","vorDoi":"10.1186/s12864-025-11752-1","vorDoiUrl":"https://doi.org/10.1186/s12864-025-11752-1","workflowStages":[]},"version":"v1","identity":"rs-5822769","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5822769","identity":"rs-5822769","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00