Widespread of horizontal gene transfer events in eukaryotes

Kun Li
Fazhe Yan
Zhongqu Duan
David L. Adelson
Chaochun Wei

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Horizontal gene transfer (HGT) is the transfer of genetic material between distantly related organisms. While most genes in prokaryotes can be horizontally transferred, HGT events in eukaryotes are considered as rare, particularly in mammals. Here we reported the identification of HGT regions (HGTs), which are genomic sequence fragments indicating the occurrence of HGT events, in human, mouse, cow, lizard, frog, zebrafish, fruit fly, nematode, Arabidopsis and yeast. By comparing the genomes of these 10 representative eukaryotes with 1,496 eukaryotic genomes, 16,098 bacteria and 11,695 viruses, we found between 10 and 243 non-redundant HGTs per species, and most of these HGTs were previously unknown. These HGTs have transformed their host genomes with various numbers of copies and have impacted hundreds, even thousands of genes. We listed several examples of HGTs and proposed some possible routes that HGT events occurred. Further analysis showed that the majority of the 1,496 eukaryotes with full length genome sequences also contain HGTs. Our findings reveal that HGT is widespread in eukaryotic genomes, and HGT is a ubiquitous driver of genome evolution for eukaryotes.

Arcadia Science
Apr 14, 2023

. For instance, among the 313 non-redundant HGT trees for104Homo sapiens, Pan troglodytes was found in 312 of them, therefore the HGT-appearance number105NHP between Homo sapiens and Pan troglodytes was 312.

This is still fairly confusing and I'm not sure what it means

Read the original source
Arcadia Science
Apr 14, 2023

Widespread of HGTs among eukaryot

Did you do anything to deal with contamination? Contamination is fairly widespread, even in refseq genomes, and might lead to unexpected results.

Read the original source
Arcadia Science
Apr 14, 2023

Functional annotation for genes overlapping with HGTs (see Methods) revealed some232significantly enriched Gene Ontology terms (GO terms) (Bonferroni<0.05) for protein-coding genes233from mouse, fruit fly and nematode as well as non-coding genes from yeast. (Table S11). The234significant GO terms for nematode were “hemidesmosome, intermediate filament”, while the235significant GO term for mouse was “protein kinase A binding”. HGTs in fruit fly that overlapped236with coding genes were enriched for “ATP binding, lipid particle, microtubule associated complex”,237etc. HGTs in yeast overlapped with non-coding genes enriched for “retrotransposon nucleocapsid,238transposition, RNA-mediated, cytosolic large ribosomal subunit”, etc.

shouldn't this be a part of the results section?

Read the original source
Arcadia Science
Apr 14, 2023

novel

can you explain what you mean by novel? not found by other studies? only found in one genome?

Read the original source
Arcadia Science
Apr 14, 2023

HGTs were clustered using the cd-hit-est program (version 4.6.6)[43] with minimum nucleotide400identity set at 80

This might not be low enough to detect orthology if the HGT event is ancient. One recent paper showed similarity drops as low as ~40% https://doi.org/10.1101/2022.08.25.505314

Read the original source
Arcadia Science
Apr 14, 2023

We further evaluated the pipeline with a genome containing simulated HGT regions. Since our78HGT identification pipeline has two main steps, sequence composition-based filtering step and79genome comparison step. The evaluation was done for the two steps (Figure S3, Table S1). While80top 1% fragments were input to the pipeline, 20.6% correct results would be identified after81sequence composition-based filtering and 14.3% correct results identified after genome comparison.82When the percentage of fragments input was up to 50%, 83.4% and 77.7% correct results were83identified after two steps respectively. It can be seen that the precision of prediction was higher than8460% for all cases. This indicated that we may have underestimated the number of HGTs (low recall85rate) but majority of the identified HGTs were highly reliable.

This …

We further evaluated the pipeline with a genome containing simulated HGT regions. Since our78HGT identification pipeline has two main steps, sequence composition-based filtering step and79genome comparison step. The evaluation was done for the two steps (Figure S3, Table S1). While80top 1% fragments were input to the pipeline, 20.6% correct results would be identified after81sequence composition-based filtering and 14.3% correct results identified after genome comparison.82When the percentage of fragments input was up to 50%, 83.4% and 77.7% correct results were83identified after two steps respectively. It can be seen that the precision of prediction was higher than8460% for all cases. This indicated that we may have underestimated the number of HGTs (low recall85rate) but majority of the identified HGTs were highly reliable.

This paragraph was a bit confusing to follow but I think I got the gist of it after a few passes through! I'm curious if you thought about controlling for natural variation in 4mer frequency throughout the genome, as some other methods have found that this helps reduce off target predictions (reviewed in https://doi.org/10.1371/journal.pcbi.1004095). It may not be necessary since you do a second step after the initial screen, but I was just curious if that was something you thought about putting in place, and if so, why you decided against it

Read the original source
Arcadia Science
Apr 14, 2023

non-redundan

Would you be willing to provide a more clear definition of non-redundant here? does this mean there are no paralogs of the gene? or the HGT only occurred in one model org? or only one genome of all of the 824+13 that you investigated?

Read the original source
Arcadia Science
Apr 14, 2023

. The copy number of each HGT was determined from the number of407merged HGT copies

Are all of these long read genomes? If not, will this be an unreliable estimate?

Read the original source
Arcadia Science
Apr 14, 2023

1000-bp segments with 200-bp

How did you assess these numbers? in metagenome binning, 1kb isn't large enough to get confident estimates of tetramernucleotide frequency; you often need > 2500 bp.

Read the original source
Arcadia Science
Apr 14, 2023

er “-e 1e-5”

evalue can change with the size of the database, how did you account for this?

Read the original source
Arcadia Science
Apr 14, 2023

1bp

Is 1 base pair a large enough overlap to be biologically important?

Read the original source
Arcadia Science
Apr 14, 2023

4 4 21, 24

Does this mean the 4 you found are different from the ones found in 21 and 24? if so, how do you account for missing the ones found in 21 and 24?

Read the original source
Arcadia Science
Apr 14, 2023

Euclidean distanc

Why did you use euclidean distance?

Read the original source
Arcadia Science
Feb 16, 2023

. The copy number of each HGT was determined from the number of407merged HGT copies

Are all of these long read genomes? If not, will this be an unreliable estimate?

Read the original source
Arcadia Science
Feb 16, 2023

er “-e 1e-5”

evalue can change with the size of the database, how did you account for this?

Read the original source
Arcadia Science
Feb 16, 2023

HGTs were clustered using the cd-hit-est program (version 4.6.6)[43] with minimum nucleotide400identity set at 80

This might not be low enough to detect orthology if the HGT event is ancient. One recent paper showed similarity drops as low as ~40% https://doi.org/10.1101/2022.08.25.505314

Read the original source
Arcadia Science
Feb 16, 2023

Euclidean distanc

Why did you use euclidean distance?

Read the original source
Arcadia Science
Feb 16, 2023

1000-bp segments with 200-bp

How did you assess these numbers? in metagenome binning, 1kb isn't large enough to get confident estimates of tetramernucleotide frequency; you often need > 2500 bp.

Read the original source
Arcadia Science
Feb 16, 2023

4 4 21, 24

Does this mean the 4 you found are different from the ones found in 21 and 24? if so, how do you account for missing the ones found in 21 and 24?

Read the original source
Arcadia Science
Feb 16, 2023

novel

can you explain what you mean by novel? not found by other studies? only found in one genome?

Read the original source
Arcadia Science
Feb 16, 2023

Functional annotation for genes overlapping with HGTs (see Methods) revealed some232significantly enriched Gene Ontology terms (GO terms) (Bonferroni<0.05) for protein-coding genes233from mouse, fruit fly and nematode as well as non-coding genes from yeast. (Table S11). The234significant GO terms for nematode were “hemidesmosome, intermediate filament”, while the235significant GO term for mouse was “protein kinase A binding”. HGTs in fruit fly that overlapped236with coding genes were enriched for “ATP binding, lipid particle, microtubule associated complex”,237etc. HGTs in yeast overlapped with non-coding genes enriched for “retrotransposon nucleocapsid,238transposition, RNA-mediated, cytosolic large ribosomal subunit”, etc.

shouldn't this be a part of the results section?

Read the original source
Arcadia Science
Feb 16, 2023

1bp

Is 1 base pair a large enough overlap to be biologically important?

Read the original source
Arcadia Science
Feb 16, 2023

Widespread of HGTs among eukaryot

Did you do anything to deal with contamination? Contamination is fairly widespread, even in refseq genomes, and might lead to unexpected results.

Read the original source
Arcadia Science
Feb 16, 2023

. For instance, among the 313 non-redundant HGT trees for104Homo sapiens, Pan troglodytes was found in 312 of them, therefore the HGT-appearance number105NHP between Homo sapiens and Pan troglodytes was 312.

This is still fairly confusing and I'm not sure what it means

Read the original source
Arcadia Science
Feb 16, 2023

non-redundan

Would you be willing to provide a more clear definition of non-redundant here? does this mean there are no paralogs of the gene? or the HGT only occurred in one model org? or only one genome of all of the 824+13 that you investigated?

Read the original source
Arcadia Science
Feb 16, 2023

We further evaluated the pipeline with a genome containing simulated HGT regions. Since our78HGT identification pipeline has two main steps, sequence composition-based filtering step and79genome comparison step. The evaluation was done for the two steps (Figure S3, Table S1). While80top 1% fragments were input to the pipeline, 20.6% correct results would be identified after81sequence composition-based filtering and 14.3% correct results identified after genome comparison.82When the percentage of fragments input was up to 50%, 83.4% and 77.7% correct results were83identified after two steps respectively. It can be seen that the precision of prediction was higher than8460% for all cases. This indicated that we may have underestimated the number of HGTs (low recall85rate) but majority of the identified HGTs were highly reliable.

This …

We further evaluated the pipeline with a genome containing simulated HGT regions. Since our78HGT identification pipeline has two main steps, sequence composition-based filtering step and79genome comparison step. The evaluation was done for the two steps (Figure S3, Table S1). While80top 1% fragments were input to the pipeline, 20.6% correct results would be identified after81sequence composition-based filtering and 14.3% correct results identified after genome comparison.82When the percentage of fragments input was up to 50%, 83.4% and 77.7% correct results were83identified after two steps respectively. It can be seen that the precision of prediction was higher than8460% for all cases. This indicated that we may have underestimated the number of HGTs (low recall85rate) but majority of the identified HGTs were highly reliable.

This paragraph was a bit confusing to follow but I think I got the gist of it after a few passes through! I'm curious if you thought about controlling for natural variation in 4mer frequency throughout the genome, as some other methods have found that this helps reduce off target predictions (reviewed in https://doi.org/10.1371/journal.pcbi.1004095). It may not be necessary since you do a second step after the initial screen, but I was just curious if that was something you thought about putting in place, and if so, why you decided against it

Read the original source
Version published to 10.1101/2022.07.26.501571 on bioRxiv
Jul 28, 2022

Horizontal Gene Transfer Between Fungi and Myxozoa: An Evolutionary Perspective

This article has 2 authors:
1. Amr G. A. Ibrahim
2. Edson A. Adriano
This article has no evaluationsLatest version Mar 17, 2026
Genomic Footprints of Multiple Host Lineages in the Mitochondrial and Nuclear Genomes of the Holoparasite <em>Prosopanche americana</em>

This article has 4 authors:
1. Laura E. Garcia
2. Maria Emilia Roulet
3. Lucia Antonella Garay
4. M. Virginia Sanchez-Puerta
This article has no evaluationsLatest version Feb 12, 2026
A conserved shufflon as an intra genus regulatory strategy in gut microbiota

This article has 11 authors:
1. Nathalie G. Gruber
2. Bernhard Hekele
3. Sanaz Khadem
4. Natalia O. Dranenko
5. Ekaterina S. Kolodyaznaya
6. Michael Hennessey-Wesen
7. Roderich Roemhild
8. Fyodor A. Kondrashov
9. David Berry
10. Calin C. Guet
11. Olga O. Bochkareva
This article has no evaluationsLatest version Mar 5, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Horizontal Gene Transfer Between Fungi and Myxozoa: An Evolutionary Perspective

Genomic Footprints of Multiple Host Lineages in the Mitochondrial and Nuclear Genomes of the Holoparasite <em>Prosopanche americana</em>

A conserved shufflon as an intra genus regulatory strategy in gut microbiota