Long-Read Genome Assembly and Gene Model Annotations for the Rodent Malaria Parasite Plasmodium yoelii 17XNL

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Malaria causes over 200 million infections and over 600 thousand fatalities each year, with most cases attributed to a human-infectious Plasmodium species, Plasmodium falciparum . Many rodent-infectious Plasmodium species, like Plasmodium berghei, Plasmodium chabaudi , and Plasmodium yoelii , have been used as genetically tractable model species that can expedite studies of this pathogen. In particular, P. yoelii is an especially good model for investigating the mosquito and liver stages of parasite development because key attributes closely resemble those of P. falciparum . Because of its importance to malaria research, in 2002 the 17XNL strain of P. yoelii was the first rodent malaria parasite to be sequenced. While sequencing and assembling this genome was a breakthrough effort, the final assembly consisted of >5000 contiguous sequences that impacted the creation of annotated gene models. While other important rodent malaria parasite genomes have been sequenced and annotated since then, including the related P. yoelii 17X strain, the 17XNL strain has not. As a result, genomic data for 17X has become the de facto reference genome for the 17XNL strain while leaving open questions surrounding possible differences between the 17XNL and 17X genomes. In this work, we present a high-quality genome assembly for P. yoelii 17XNL using HiFi PacBio long-read DNA sequencing. In addition, we use Nanopore long-read direct RNA-seq and Illumina short-read sequencing of mixed blood stages to create complete gene models that include not only coding sequences but also alternate transcript isoforms, and 5’ and 3’ UTR designations. A comparison of the 17X and this new 17XNL assembly revealed biologically meaningful differences between the strains due to the presence of coding sequence variants. Taken together, our work provides a new genomic and gene expression framework for studies with this commonly used rodent malaria model species.

Article activity feed

  1. Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    Thank you for the rapid and favorable reviews of our manuscript entitled “Long-Read Genome Assembly and Gene Model Annotations for the Rodent Malaria Parasite Plasmodium yoelii 17XNL.” We particularly appreciated that both reviewers had substantial, detailed expertise with the sequencing and assembly of Plasmodium genomes, and valued their questions and suggestions to ensure high rigor of our work. We have addressed all of the reviewers’ comments in the revised manuscript, and have provided a point-by-point response to each below.

    Response to Reviewers

    Note: Point-by-point responses are provided in italics below each reviewer comment below. Line numbers referenced in our responses refer to their final line position in the Track Changes version of the manuscript.


    Reviewer #1 (Evidence, reproducibility and clarity (Required)):

    The manuscript entitled "Long-Read Genome Assembly and Gene Model annotation for the Rodent Malaria Parasite P. yoelii 17XNL" is a well-written manuscript providing updates and important observations about the genome assembly and annotation of this specific non-lethal isolate. The group overall did a great job showing how the application of newer technologies such as long-read DNA and direct RNA sequencing to generate top-quality genomes to be used as a reference for the community. Here are some comments about the work presented:

    Response: Thank you for your positive feedback and suggestions on how to clarify these findings. We have improved the revised manuscript based on your feedback and suggestions below.

    Major comments:

    • The authors added several result information across the methods section. Making the text repetitive, since the same is also presented in the results section. Please revise the method section to remove results from this section.

    Response: We agree and have streamlined both the Results and Methods sections to remove redundancy in these descriptions.

    • Some methods are also redundant in the Result section. For example, in line 141-142, the group describe which DNA extraction kit they used (again this is correctly mentioned in the methods section).

    Response*: We agree and have removed minutiae such as these from the Results section. These details remain in the Methods section to ensure reproducibility. *

    • Besides important, the group added several information about method comparison between base call accuracy and sequencing methods. I agree that having this information in the supplemental material is great, but I would be careful to not focus too much on those, since most of the observations are already well-known by the community and focus more in the biological relevance of what is being generated with the newly updated genome.

    Response: The advances in base calling algorithms do make substantial improvements to the Nanopore reads. We have only included a short description of this in the main manuscript and feel this is an appropriate amount of context for the typical reader. Those that love these details and want to dig further can find this content in our supplemental information.

    • The group did a great job generating two versions of the genome, and an updated gene annotation set using long-read sequencing. But the major question is, how about alternative splicing? They mention the use of it (line 350) but I don't see any result about how many alternative transcripts were observed, and if they were differentially detected in different life stages of the sets used for the RNA sequencing. This is a very important result to be added since one of the key pieces of information that long-read RNA sequencing brings for Genome annotation.

    Response: We have now expanded this description in the manuscript to note that 866 genes are predicted to have multiple transcript isoforms (Lines 240-241). Moreover, we have now generated a Supplemental Table 4 that lists these isoforms in the revised manuscript. As we have not conducted further validation of this large number of transcript isoforms, we have left the description at this level.

    • Same observation as above for potential long ncRNAs.

    Response: We agree that lncRNAs are a fascinating aspect of the biology of the parasite, but a proper analysis of this class of RNA is far outside of the scope of this current study. Automatic identification approaches with Nanopore data will likely yield high numbers of false positives, which require manual curation for rigorous annotation. We hope others can use these data to accelerate such studies as well.

    • From what I understand the Hifi run was able to generate a gapless genome assembly and the ONT run did not. What was the final coverage for each? From my experience with P. falciparum genomes, ONT even with the rapid kit was able to generate chromosomal level assemblies if the coverage was >100x (but again, this is not a rule). Add those valuable observations about the depth so the reader can check if other variables in the comparison should be made.

    Response: This is a particularly interesting aspect of not only our datasets, but of other Plasmodium genomes as well. This issue occurs at least in part due to the presence of many repeated elements in the subtelomeric regions. It is important to note that these repeated elements do not resolve into a single haplotype in an assembly due to conflicting information, not due to lack of coverage. For instance, regions may differ by only a few nucleotides that each have significant read support. We are particularly interested in a recent preprint that concludes that P. falciparum harbors extrachromosomal plasmids with these var sequences present (doi.org/10.1101/2023.02.02.526885). *If this observation is supported via peer review, this interpretation could also begin to explain our results with P. yoelii 17XNL as well. *

    • Also be sure that the structural comparisons between the genomes are not the ones used after running ragtag.py. If so, there is a high chance of structural bias in the scaffolded contigs.

    Response: We apologize for the confusion. We did not use ragtag for the PacBio assembly, and all structural and variant comparisons were done using the PacBio assembly. However, we did use ragtag for the Nanopore assembly that is included in this study as an additional resource to our community. These data were not used for variant calling though.

    • How Prokka differed from Braker2 for the Mitochondria/API annotation? This needs to be very well described since prokka is made for prokaryotic organisms and not for eukaryotic ones. And Braker2 uses a custom build dataset for training, which I believe contains known information about MIT/API for Plasmodium species.

    Response: We first applied Braker2 to the organellar genomes and identified only 6 genes in the apicoplast genome and only 2 genes in the mitochondrial genome. Due to their prokaryotic origin, we then tested if Prokka could alleviate this issue. To do so, we applied Prokka to the 17X reference genome and found that it detected all of its annotated organellar genes. Therefore, we also applied Prokka to our Py17XNL genome to annotate the genes found on the apicoplast and mitochondrial genomes. As a final validation check, the gene annotations on these two organellar genomes are effectively identical between 17X and 17XNL. This is consistent with the sequencing results and assemblies that show that the apicoplast genome is identical and the mitochondrial genome differs in a single, notable deletion in 17XNL.

    • Figure 5B, what is the peak observed in the mitochondria? What genes? Repeats?

    Response: What appears to be an inward pointed trough actually reflects the deletion of bases in 17XNL compared to the 17X assembly. We have clarified this in the manuscript on Lines 296-297 and in the legend of Figure 5.

    Minor comments:

    • For Oxford nanopore sequencing using the ligation kit, did the group check for potential chimeric reads generated by the protocol?

    Response*: We did. We used the adapter trimming software, Porechop, to identify and bin chimeric reads that were eliminated from the dataset. This method is described in the Makefile associated with the manuscript. *

    • Check if all species are italicized (for example, line 187 P. yoelii is not)

    Response: We have italicized this instance of P. yoelii and have reviewed the document to search for any other words that should be italicized.

    • In methods add the parameters for minimap2 for the direct RNA alignment

    Response*: We would encourage readers to view our MakeFile that has all of the commands and parameters used for the bioinformatic work reported here. *

    • For variant calling, I would use a minimum of 10x coverage to make a variant call instead of 5x. Besides looking well reproducible between all checks, I would be careful mainly with the single bp deletions with a such low threshold.

    Response: Read counts for the called variants were generally greater than 20. Moreover, we took these validations a step further and manually curated these variants using the data from multiple sequencing platforms used in this study to ensure high rigor in making these variant calls. We have further clarified this in the revised manuscript.

    • In some parts of the methods, the authors mentioned slight modifications in some protocols (for example, lines 443 and 454), besides well described in the text, could you highlight what were the modifications in the text? This will facilitate many other researchers to understand why those modifications were needed.

    Response: We have clarified these modifications in the revised manuscript. In short, these modifications consisted of: 1) For the HMW gDNA prep kit, an agitation speed of 1500 rpm was used as opposed to the recommended 2000 rpm due to limitations of our instruments. 2) A slow end over end mixing by hand was preferred over using a vertical rotating mixer as yield was consistently greater with this change. 3) For the RNeasy kit, the lysate was passed through a 20-gauge needle for homogenization of the sample. Instead of an on-column DNaseI treatment, the RNA was treated with DNaseI off of the column to promote complete DNA digestion. 4) A second elution from the RNeasy column was performed in order to improve yield.

    • As mentioned in the major, the data analysis method section needs rework to remove results from the text.

    Response: We have revised the manuscript accordingly.

    • The group mentioned that small contigs not mapping to Py17X were discarded. What are those? Repeats? Contamination?

    Response: These contigs were of mouse origin, as P. yoelii was grown in Swiss webster mice in this work. We have clarified this in the revised manuscript on Lines 183-184.

    Reviewer #1 (Significance (Required)):

    This work generated a strong method and resource for a better genome assessment of P. yoelii for the community. As I mentioned in my comments, some more details about the findings such as alternative splicing and lncRNAs may strengthen them even more the publication. I know that comparative analysis between Py17X and XNL is not in the scope here, but more information about it, such as a synteny plot would be great for the community to understand that they can rely on this new reference genome. I've been working with eukaryotic and prokaryotic genomes for more than a decade and I have a lot of experience with all the methods presented. I believe that potentially the depth generated for the ONT data may be one of the factors for not reaching the chromosomal level of this isolate, since HiFI was. The group did a great job on the method description, and I believe that the community will be very happy to incorporate this genome as one of the references for this organism.

    Response: We are thrilled that you value the data and the rigor of our approaches. We also believed that a direct comparison between 17X and 17XNL strains is critical. Because of this, we provided details of this comparison in Figures 5 and 6, as well as in supplemental files. Because our colleagues often use these strains interchangeably, it is important for our community to know what differences are present between the parental 17X and the cloned 17XNL line. While substantial identity exists between the 17X and 17XNL strains, there are many variants between them, including many that affect genes that are known to have essential functions for the parasite. For this reason and more, we believe the true 17XNL genome assembly will be a preferred reference once it is fully integrated into PlasmoDB.

    Reviewer #2 (Evidence, reproducibility and clarity (Required)):

    The paper has three distinct parts,

    1. Assembly of the P. yoelii yoelii 17XNL 2 Annotation of the genome and adding UTR regions
    2. Comparing the sequence of 17XNL with 17X .

    Assembly: The authors present a novel assembly for the P. yoelii yoelii 17XNL genome. They used two different approaches, comparing Oxford Nanopore (ONT) long reads + Illumina DNA with PacBio Hifi. None of the approaches generated a telomer to telomer assembly so sequences from the 17X reference was used to fill in the mssing sequence.

    Response: Please also see the comment from Reviewer 1 and our response. The presence of many repeated elements in the subtelomeric regions leads to the challenges noted here about a telomere-to-telomere assembly, as well. The presence of these elements means that the sequences do not resolve into a single haplotype in an assembly due to conflicting information, not due to lack of coverage. Because of this, we have chosen to harmonize the selected haplotype at these subtelomeric regions with that of 17X, while still acknowledging and providing the complex data associated with the subtelomeric regions.

    Annotation Next, they generated long reads (ONT)and Illumina RNA-Seq to improve the annotation. Although, their annotation is not better than the current P. yoelii 17X reference genome in PlasmoDB, they could predict the UTR regions and alternative splice sites due to the 3' capturing approach and long reads. Having the UTR annotated and potentially having alternative splice sides is useful for the field.

    Response: We agree that the additional gene model annotations for both UTRs and alternative transcript isoforms is a valuable resource to our community. We are working with PlasmoDB currently to make these data readily accessible.

    17XNL - 17X comparison The author compared the 17XNL with the 17X reference. Both genomes were done with Pacbio, and it should be noted that P. yoelii has a GC content of probably ~23% with several homopolymer tracks. Further, the 17XNL genotype was obtained from a 17X culture, so the genomes are expected to be very similar as the author noted in the introduction. The authors found ~2000 differences; some are in genes, but many are indels, which very well could be sequencing errors. Finally, the authors claim that this genome could become relevant for the community as new reference to perform analysis. As their genome is so similar to 17X and they have to show that their annotation is at least as good as the current 17X reference genome (manual curated) and the difference are not due to sequence error in 17X or 17XNL.

    Response: As we describe below, we have taken multiple steps to inspect the quality of the 17X genome assembly (it is very robust), to call variants between strains, and to validate them using our data across multiple sequencing platforms and via manual curation. Because of this, we view these as true variants between the 17X and 17XNL genomes

    Major comments Overall I struggle to see the need for a "NEW" P. yoelii reference. It would be good to state how similar these genomes are - they are basically identical. As the 17XNL is curated manually, it would have made more sense to me to start from that one and then generate the UTR annotation and include splice sides. This could be easily loaded into an alternative Web-apollo track and then merged to the current annotation to be useful to the community.

    Response*: We chose to generate a new reference assembly for 17XNL because the current one is from 2002, remains in >5000 contigs, has gene identifiers that do not align with other current Plasmodium gene models (e.g., PY00204 vs. PY17X_0502200), and historically has had problematic gene models attributed to individual genes. This clean start ensures that users can know the provenance of the underlying data that created the genome assembly and gene models. *

    I wonder if many of the differences the authors found between 17X and the 17XNL reference are true. The authors are correct that some differences between 17X and 17XNL are true. I could not find any evidence of genome polishing with tools like Pilon or ICORN to correct sequencing errors, I wonder if these differences are sequencing errors.

    Response: The PacBio-based assembly received no error correction or polishing. It should be noted that all variants that were called automatically were also manually verified using data from multiple sequencing platforms generated in this study. Moreover, for coding sequences, we imposed a threshold that 80% of all reads at the variant’s location needed to support the variant in order to be considered true. Through these strict thresholds, we eliminated many potential variants that only had support from one sequencing platform. We highlight several variants that were confirmed through multiple datasets in Table 2.

    Did the authors look into the reads of the NCBI - GCA_900002385.2 - assembly? Maybe they could use the underlying Illumina reads if theirs don't have enough coverage. Also, the differences between 17X and 17XNL could be that the reference is wrong. How many pseudo genes did they obtain? Are there more or less than in the current reference?

    To confirm the calls, could you also map the 17XNL reads against the 17X reference and see if they are still true. As the same time, map the 17X illumina reads to see if the reference is correct at this state. When looking at the alignments, it can be seen that many different are in low complexity/repetitive regions.

    Response: We analyzed both their raw and assembled data to compare them with our results, and we determined that the 17X data and assembly were robust and that these difference likely reflect true variance between the strains. The 17X reference has 57 pseudogenes that are annotated as pir, fam-a/c, or others. Overall, there were 1057 pir genes annotated in the 17X genome, whereas we annotated 1048 for our Py17XNL genome. There were 302 fam-a/b genes annotated in the 17X genome, whereas we annotated 301 for our Py17XNL genome. As noted above, we confirmed variant calls using data from multiple sequencing platforms in this study as well as through manual curation.

    The authors sequence their genome with a HiFi Pacbio run and also ONG + DNASeq... but why did they not get 16 chomromes out? For example the current P. yoelii reference was assembled directly into far less pieces than theirs [P. chabaudi assembles into 16 pieces]. Could it be a different read depth or is it the fragment length? Could the authors please comment on that. Also, if there were contigs, why did they fill the sequence with 17X sequence, rather than keeping gaps? So in the end, their sequence is a hybrid, of 17X and 17XNL, right?

    Response: Please see our responses above to both Reviewer 1 and 2 regarding the heterogeneity of the subtelomeric regions that indicate that a single haplotype is not readily called. This is not due to insufficient read depth, but rather we believe it reflects something fascinating about Plasmodium genomes in these regions. A recent preprint (doi.org/10.1101/2023.02.02.526885) provides one possible interpretation for this observation.

    Why do you think you had less coverage of CCS read around the telomer ends? Do you think it is a systematic issue of the PacBio Hifi? Did you see any evidence of Illumina or ONT reads - or could it be that while culturing the telomer ends dropped off?

    Response: See our response above about the challenging nature of the subtelomeric regions of Plasmodium genomes. As above, this is not an issue of coverage per se, but rather of heterogeneous related sequences that are not readily resolved into a single haplotype. In order to minimize the risk of sequencing a genome of a mixture of heterogeneous parasites, we sequenced “Pass 0” parasites received directly from BEI Resources to ensure this genome reflects the established P. yoelii 17XNL clone.

    I realised that the authors used a lot of primary tools. I wonder why they chose that path, as there are several tools to do automatic finishing for long read assemblies: Assemblosis, ARAMIS, MpGAP or ILRA. Especially the last one focuses on Plasmodium genomes. Please comment.

    Response: We initially started our bioinformatic analyses using established tools such as these. Specifically, we first tried Companion and ILRA, but the results were not superior to those we achieved with the workflow we describe in this manuscript, which also provided greater parameter control.

    Also, for the annotation, could it not be better to transfer the manually curated genome annotation with LIFT off or RATT? All these tools are widely used in the generation of reference genomes in the parasitology field. I annotated their sequence with Companion, and although their gene models are good and some of the Companion calls might need improvement, overall, the Companion results look more exact to me.

    Response: Companion was the original tool we used for the generation of gene models. While we found that for a pre-package software platform it performed excellently, we found it to be insufficiently customizable and the results were not sufficiently accurate from our assessment. Additionally, lifting over information always raises the risk of imposing a different perspective on what is truly present. We believe that a high quality, de novo assembly is always preferable, and therefore chose this workflow.

    The code is very well organised, and it was easy to follow. Are you planning to put it on a GitHub repository?

    Response: We appreciate this recognition. We believe clear reporting of the bioinformatics work is critical for rigor and reproducibility. Yes, all of this will also be provided in GitHub to benefit the wider community.

    For the annotation in the attachment, there were two files. I had a look at them and they were quite different. As 17X and this genome are basically identical (Response: The two gff files represent either a Nanopore only or hybrid Nanopore+Illumina-based model. The latter produced a more comprehensive annotation of gene models, which is what we have proceeded with. However, we provided both in case end users find value in the Nanopore only annotation which has a 3’ bias due to the mechanism of how sequencing occurs via this approach.

    We have found meaningful variations in genome sequence that potentially impact biological function (see Discussion). Therefore, we maintain that these genomes are not basically identical and are useful to the malaria research community for these reasons and more.

    It is excellent that the genome is submitted to NCBI. Why are there 18k proteins? Are these the alternative spliced forms?

    Response*: We are not certain how this interpretation might have arisen, as we only have reported 7047 potential transcript isoforms to NCBI based upon our data. *

    Minor The current Py 17X genome in PlasmoDB is a Pacbio assembly (https://plasmodb.org/plasmo/app/record/dataset/TMPTX_pyoeyoelii17X), but not part of the 2014 paper. It was submitted later to NCBI than the paper the authors cite. Also, the current P. berghei Pacbio genome is from Fougère et al. PLoS Pathog 2016;12(11):e1005917.

    Response: We have now made a detailed note about the Py17X PacBio dataset in our revised manuscript on Lines 186-187. Mentions of the current P. berghei genome assembly had already cited the Foug’ere et al. publication.

    I tried to open the supplemental tables, but they were all in pdf rather than excel and split over several pages. Two had missing information, e.g. UTR per gene. From the name of the tables, I had an idea of what they should contain, but for a re-submission, it would be good to have them in the correct format.

    Response: We agree that provision of the PDFs of the supplemental files is not the ideal way to review these analyses. The complete data was also provided in the Excel files provided to Review Commons. We will ensure that the affiliate journal receives the Excel files for completion’s sake.

    To me, the beginning of the results reads a bit like an introduction (the part which sequencing technology to use)

    Response: We agree, and as noted to Reviewer 1 above, we have streamlined this section of the revised manuscript.

    Could you add to the tables: Sequence Coverage of the three technology, how many contigs you had before ordering the contigs and the number of pseudogenes in the annotation?

    Response: This information is now provided in Supplemental Table 3 in the revised manuscript.

    I struggle with the section header line 229-230 that the new sequence is more complete as it is a hybrid assembly with 17X. Alternatively, please explain how the consensus was built.

    Response: We agree and have revised this section header for accuracy.

    The authors correctly state that ONG is great, lines 333ff, but why does it not generate telomer-to-telomer chromosomes in this case? Please discusss.

    Response: Please see our response to this above for remarks made by both Reviewer 1 and 2. We have also added clarifying text in our revised manuscript discussing why this may have occurred.

    Reviewer #2 (Significance (Required)):

    General assessment As mentioned above, I struggle to see this as a strong leap for the malaria community to use this genome, as it is so similar to the current 17X genome, which is manually curated in plasmodb. Response: We agree that it is important to know how similar the genomes of 17X and the cloned 17XNL strain are. It is perhaps even more important to know what the key differences are as well. In this study, we have asked and answered these questions, and identified 2000+ variants between the strains. We have manually curated several of the variants that impact the expression of essential/important genes, and found that biologically meaningful differences exist (see Discussion). Finally, we have also provided additional information on the gene models of 17XNL, including an experimental definition of UTRs and transcript isoforms. Together, we hold that these data will not only match those currently available for 17X, but will exceed them. We are currently working with PlasmoDB to make these data readily accessible to our community.

    Advance The authors should make the comparison of ONT and PacBio HiFi clearer and discuss why the technologies still don't generate telomer-to-telomer sequences. From the biological side, none of the found differences were related to the different phenotype between 17X and 17XNL.

    Response: We have provided these comparisons and all related data to the reader in this manuscript, as well as through public depositories. Please see above for our responses as to why a true telomere-to-telomere assembly is challenging with Plasmodium parasites, and for a recent preprint that might provide an explanation for this. Finally, the phenotypic differences between 17X and 17XNL are variable, which might reflect differences in individual parasite stocks as has been historically seen in the spontaneous development of lethality in multiple laboratories. While we do not find any particular genetic difference correlates with a specific phenotype, these data using the cloned 17XNL parasite available from BEI provides a robust reference with a defined parasite stock.

    Audience: I do agree that adding the UTR sequence will be useful for those working with P. yoelii as a model, or who want to do comparative UTR analysis across species.

    Response: We agree that this additional gene model information will be valuable. We are working with PlasmoDB to make this information readily available and are already integrating it into our ongoing studies.

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    The paper has three distinct parts,

    1. Assembly of the P. yoelii yoelii 17XNL 2 Annotation of the genome and adding UTR regions
    2. Comparing the sequence of 17XNL with 17X .

    Assembly: The authors present a novel assembly for the P. yoelii yoelii 17XNL genome. They used two different approaches, comparing Oxford Nanopore (ONT) long reads + Illumina DNA with PacBio Hifi. None of the approaches generated a telomer to telomer assembly so sequences from the 17X reference was used to fill in the mssing sequence.

    Annotation Next, they generated long reads (ONT)and Illumina RNA-Seq to improve the annotation. Although, their annotation is not better than the current P. yoelii 17X reference genome in PlasmoDB, they could predict the UTR regions and alternative splice sites due to the 3' capturing approach and long reads. Having the UTR annotated and potentially having alternative splice sides is useful for the field.

    17XNL - 17X comparison The author compared the 17XNL with the 17X reference. Both genomes were done with Pacbio, and it should be noted that P. yoelii has a GC content of probably ~23% with several homopolymer tracks. Further, the 17XNL genotype was obtained from a 17X culture, so the genomes are expected to be very similar as the author noted in the introduction. The authors found ~2000 differences; some are in genes, but many are indels, which very well could be sequencing errors.

    Finally, the authors claim that this genome could become relevant for the community as new reference to perform analysis. As their genome is so similar to 17X and they have to show that their annotation is at least as good as the current 17X reference genome (manual curated) and the difference are not due to sequence error in 17X or 17XNL.

    Major comments

    Overall I struggle to see the need for a "NEW" P. yoelii reference. It would be good to state how similar these genomes are - they are basically identical. As the 17XNL is curated manually, it would have made more sense to me to start from that one and then generate the UTR annotation and include splice sides. This could be easily loaded into an alternative Web-apollo track and then merged to the current annotation to be useful to the community.

    I wonder if many of the differences the authors found between 17X and the 17XNL reference are true. The authors are correct that some differences between 17X and 17XNL are true. I could not find any evidence of genome polishing with tools like Pilon or ICORN to correct sequencing errors, I wonder if these differences are sequencing errors. Did the authors look into the reads of the NCBI - GCA_900002385.2 - assembly? Maybe they could use the underlying Illumina reads if theirs don't have enough coverage. Also, the differences between 17X and 17XNL could be that the reference is wrong. How many pseudo genes did they obtain? Are there more or less than in the current reference?

    To confirm the calls, could you also map the 17XNL reads against the 17X reference and see if they are still true. As the same time, map the 17X illumina reads to see if the reference is correct at this state. When looking at the alignments, it can be seen that many different are in low complexity/repetitive regions. The authors sequence their genome with a HiFi Pacbio run and also ONG + DNASeq... but why did they not get 16 chomromes out? For example the current P. yoelii reference was assembled directly into far less pieces than theirs [P. chabaudi assembles into 16 pieces]. Could it be a different read depth or is it the fragment length? Could the authors please comment on that. Also, if there were contigs, why did they fill the sequence with 17X sequence, rather than keeping gaps? So in the end, their sequence is a hybrid, of 17X and 17XNL, right?

    Why do you think you had less coverage of CCS read around the telomer ends? Do you think it is a systematic issue of the PacBio Hifi? Did you see any evidence of Illumina or ONT reads - or could it be that while culturing the telomer ends dropped off?

    I realised that the authors used a lot of primary tools. I wonder why they chose that path, as there are several tools to do automatic finishing for long read assemblies: Assemblosis, ARAMIS, MpGAP or ILRA. Especially the last one focuses on Plasmodium genomes. Please comment.

    Also, for the annotation, could it not be better to transfer the manually curated genome annotation with LIFT off or RATT? All these tools are widely used in the generation of reference genomes in the parasitology field. I annotated their sequence with Companion, and although their gene models are good and some of the Companion calls might need improvement, overall, the Companion results look more exact to me. The code is very well organised, and it was easy to follow. Are you planning to put it on a GitHub repository? For the annotation in the attachment, there were two files. I had a look at them and they were quite different.

    As 17X and this genome are basically identical (<2k variants), would it not be better to transfer the genes from the 17X genome and then add the UTR (see comment before)? The 17X is manually curated. Table 1 and figure 4 show that it is far better. I doubt that the community would use this genome, if the annotation is not lifted over.

    There are two gff files in the supplemental. Which one is better? It is excellent that the genome is submitted to NCBI. Why are there 18k proteins? Are these the alternative spliced forms?

    Minor

    The current Py 17X genome in PlasmoDB is a Pacbio assembly (https://plasmodb.org/plasmo/app/record/dataset/TMPTX_pyoeyoelii17X), but not part of the 2014 paper. It was submitted later to NCBI than the paper the authors cite. Also, the current P. berghei Pacbio genome is from Fougère et al. PLoS Pathog 2016;12(11):e1005917. I tried to open the supplemental tables, but they were all in pdf rather than excel and split over several pages. Two had missing information, e.g. UTR per gene. From the name of the tables, I had an idea of what they should contain, but for a re-submission, it would be good to have them in the correct format. To me, the beginning of the results reads a bit like an introduction (the part which sequencing technology to use) Could you add to the tables: Sequence Coverage of the three technology, how many contigs you had before ordering the contigs and the number of pseudogenes in the annotation? I struggle with the section header line 229-230 that the new sequence is more complete as it is a hybrid assembly with 17X. Alternatively, please explain how the consensus was built. The authors correctly state that ONG is great, lines 333ff, but why does it not generate telomer-to-telomer chromosomes in this case? Please discusss.

    Significance

    General assessment As mentioned above, I struggle to see this as a strong leap for the malaria community to use this genome, as it is so similar to the current 17X genome, which is manually curated in plasmodb.

    Advance The authors should make the comparison of ONT and PacBio HiFi clearer and discuss why the technologies still don't generate telomer-to-telomer sequences. From the biological side, none of the found differences were related to the different phenotype between 17X and 17XNL.

    Audience: I do agree that adding the UTR sequence will be useful for those working with P. yoelii as a model, or who want to do comparative UTR analysis across species.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    The manuscript entitled "Long-Read Genome Assembly and Gene Model annotation for the Rodent Malaria Parasite P. yoelii 17XNL" is a well-written manuscript providing updates and important observations about the genome assembly and annotation of this specific non-lethal isolate. The group overall did a great job showing how the application of newer technologies such as long-read DNA and direct RNA sequencing to generate top-quality genomes to be used as a reference for the community. Here are some comments about the work presented:

    Major comments:

    • The authors added several result information across the methods section. Making the text repetitive, since the same is also presented in the results section. Please revise the method section to remove results from this section.
    • Some methods are also redundant in the Result section. For example, in line 141-142, the group describe which DNA extraction kit they used (again this is correctly mentioned in the methods section).
    • Besides important, the group added several information about method comparison between base call accuracy and sequencing methods. I agree that having this information in the supplemental material is great, but I would be careful to not focus too much on those, since most of the observations are already well-known by the community and focus more in the biological relevance of what is being generated with the newly updated genome.
    • The group did a great job generating two versions of the genome, and an updated gene annotation set using long-read sequencing. But the major question is, how about alternative splicing? They mention the use of it (line 350) but I don't see any result about how many alternative transcripts were observed, and if they were differentially detected in different life stages of the sets used for the RNA sequencing. This is a very important result to be added since one of the key pieces of information that long-read RNA sequencing brings for Genome annotation.
    • Same observation as above for potential long ncRNAs.
    • From what I understand the Hifi run was able to generate a gapless genome assembly and the ONT run did not. What was the final coverage for each? From my experience with P. falciparum genomes, ONT even with the rapid kit was able to generate chromosomal level assemblies if the coverage was >100x (but again, this is not a rule). Add those valuable observations about the depth so the reader can check if other variables in the comparison should be made.
    • Also be sure that the structural comparisons between the genomes are not the ones used after running ragtag.py. If so, there is a high chance of structural bias in the scaffolded contigs.
    • How Prokka differed from Braker2 for the Mitochondria/API annotation? This needs to be very well described since prokka is made for prokaryotic organisms and not for eukaryotic ones. And Braker2 uses a custom build dataset for training, which I believe contains known information about MIT/API for Plasmodium species.
    • Figure 5B, what is the peak observed in the mitochondria? What genes? Repeats?

    Minor comments:

    • For Oxford nanopore sequencing using the ligation kit, did the group check for potential chimeric reads generated by the protocol?
    • Check if all species are italicized (for example, line 187 P. yoelii is not)
    • In methods add the parameters for minimap2 for the direct RNA alignment
    • For variant calling, I would use a minimum of 10x coverage to make a variant call instead of 5x. Besides looking well reproducible between all checks, I would be careful mainly with the single bp deletions with a such low threshold.
    • In some parts of the methods, the authors mentioned slight modifications in some protocols (for example, lines 443 and 454), besides well described in the text, could you highlight what were the modifications in the text? This will facilitate many other researchers to understand why those modifications were needed.
    • As mentioned in the major, the data analysis method section needs rework to remove results from the text.
    • The group mentioned that small contigs not mapping to Py17X were discarded. What are those? Repeats? Contamination?

    Significance

    This work generated a strong method and resource for a better genome assessment of P. yoelii for the community. As I mentioned in my comments, some more details about the findings such as alternative splicing and lncRNAs may strengthen them even more the publication. I know that comparative analysis between Py17X and XNL is not in the scope here, but more information about it, such as a synteny plot would be great for the community to understand that they can rely on this new reference genome.

    I've been working with eukaryotic and prokaryotic genomes for more than a decade and I have a lot of experience with all the methods presented. I believe that potentially the depth generated for the ONT data may be one of the factors for not reaching the chromosomal level of this isolate, since HiFI was. The group did a great job on the method description, and I believe that the community will be very happy to incorporate this genome as one of the references for this organism.