Bacterial Prophage Precision Prediction Solution
Abstract
Prophages are dormant forms of temperate bacteriophages integrated into bacterial chromosomes, playing central roles in microbial evolution, virulence regulation, and metabolic expansion. Accurate delineation of prophage boundaries — particularly the coordinates of recombination sites attL/attR — is a prerequisite for in vitro phage rebooting, phage therapy development, and probiotic safety assessment.
This solution uses ONT R10.4.1 third-generation long-read sequencing assemblies (finished genomes) as input, building a dual-engine collaborative prediction framework (geNomad + PIDE), integrated with CheckV completeness assessment, PHASTEST functional annotation, and PhiSpy independent validation. It achieves high-precision prophage prediction with att-site coordinate deviation ≤±100 bp, and has been systematically validated on 4 Lactobacillus genomes (13 prophages in total), with 100% PhiSpy validation rate and all PHASTEST completeness ratings as Intact.

1. BackgroundIntroduction
1.1 Definition and Research Significance of Prophages
A prophage is the dormant form of a temperate phage genome after site-specific recombination and integration into the host chromosome. During the lysogenic cycle (Lysogeny), phage DNA replicates together with the host genome. When the host is exposed to induction signals such as DNA damage, SOS response, or nutritional stress, the prophage excises from the integration site and initiates the lytic cycle (Lytic Cycle), ultimately lysing the host to release mature phage particles. The research significance of prophages is reflected in the following aspects:
(1)Horizontal Gene Transfer (HGT): Auxiliary metabolic genes (AMGs) and virulence genes carried by prophages spread across bacterial communities via lysogenic conversion, representing a major source of bacterial genome diversity.
(2) Virulence Factor Encoding: Key virulence factors such as cholera toxin (CT) and Shiga toxin (Stx) are prophage-encoded; their induced expression directly affects the pathogenicity of bacterial pathogens.
(3) Probiotic Stability: In industrial lactic acid bacteria, spontaneous prophage induction is a significant risk factor for production batch failures and must be systematically assessed during strain development.
(4) Phage Engineering: After precisely locating prophage boundaries (including att sites), the complete phage genome can be amplified for in vitro rebooting of infectious phage particles.
1.2 Integration Mechanism and att Sites
The integration process is mediated by the phage-encoded integrase: the enzyme recognizes the phage attP site (Phage Attachment Site) and the host attB site (Bacterial Attachment Site) and catalyzes strand exchange, generating the flanking attL and attR sites and embedding the phage genome into the host chromosome. Excision requires cooperative action of the integrase and excisionase (Xis) to restore the prophage to a circular molecule that enters the lytic cycle.

Figure 1. Schematic of phage lysogenic/lytic cycle switching and att site recombination
The coordinate accuracy of att sites is critical for in vitro reconstruction. A deviation of even a few dozen base pairs in attL/attR positions can cause integrase recognition failure, rendering rebooting experiments entirely ineffective. Based on systematic benchmarking by Zhou et al. (2025), boundary coordinate errors among current mainstream prediction tools span a wide range (from hundreds to thousands of bp), underscoring the practical necessity of establishing a high-precision prediction pipeline.
1.3 Technical Support of Third-Generation Sequencing for Prophage Prediction
Short-read sequencing (e.g., Illumina PE150) frequently causes assembly breaks (contig breakpoints) at att sites due to short repeat sequences, leading to systematic boundary coordinate drift. The ONT PromethION R10.4.1 platform with SUP (Super Accuracy) basecalling produces high-quality data with mean read length >5 kb and Q20+ base proportion >73%, enabling one-pass acquisition of finished bacterial genomes and fundamentally eliminating assembly break issues. Taking the validation strain Lacticaseibacillus casei as an example, the raw sequencing depth was approximately 365×, yielding an assembly of 2 circular contigs (N50 = 2,940,950 bp; BUSCO ≥99%), providing a high-quality sequence background for accurate prediction.
1.4 Major Technical Challenges
(1) att Site Boundary Delineation: The attL/attR regions contain short repeat sequences; automated tools generally exhibit systematic over- or under-prediction biases;
(2) Degraded Prophage Identification: Genomic features differ significantly across degradation levels (Intact/Questionable/Incomplete), and judgment criteria are inconsistent across tools;
(3) Homology Limitations in AT-rich Genomes: Lactobacillus GC content ranges from 34–50%; sequence-homology-based tools carry a risk of missing novel phages with low similarity;
(4) Insufficient Single-tool Precision: Zhou et al. (2025) evaluation shows Base Precision ranging from 0.35–0.91 across existing tools; multi-tool collaborative validation is required.
2. Overview of Prediction Tools and Selection Rationale
2.1 Performance Comparison of Major Tools
Current mainstream prophage prediction tools are based on different core algorithms, each with distinct emphases in boundary accuracy, completeness assessment, and computational efficiency. Base Precision is defined as the proportion of bases in the predicted sequence that belong to the true phage genome (TP bases / [TP bases + FP bases]), and is the core metric for measuring the false-positive base rate in boundary prediction.
Table 1. Comparison of major prophage prediction tools (Base Precision data source: Zhou et al., 2025, Genome Biology)
2.2 Tool Selection Rationale
This solution's tool selection is based on the systematic benchmarking by Zhou et al. (2025) on finished genomes of 38 intestinal bacterial strains (using induction experiment sequencing results as the gold standard), following these principles:
(1) Dual Primary Prediction: In vitro rebooting requires strict att-site coordinate accuracy (deviation ≤±100 bp). The highest Base Precision tool PIDE (0.91) and the best overall-metric tool geNomad (MCC=95.3%) are used in parallel; the two complement each other in algorithmic design and applicable scenarios;
(2) VirSorter2 Exclusion: Zhou et al. (2025) showed VirSorter2 Base Precision of only ~0.35, with excessively high boundary false-positive rates, making it unsuitable for downstream applications requiring precise att coordinates;
(3)CheckV Independent Quality Control: As a quality assessor completely independent from prediction tools, CheckV performs quantitative completeness scoring of candidate sequences based on amino acid composition and reference viral genomes, effectively filtering low-quality fragments;
(4) PHASTEST Functional Validation: Its standardized completeness classification system (Intact/Questionable/Incomplete) is used for cross-validation of candidate prophages, with functional gene annotation provided for manual review;
(5) PhiSpy Global Validation: As an independent whole-genome scanning tool, it is used to cross-validate all candidate regions, with a required validation rate of 100%.
3. Integrated Analysis Pipeline
3.1 Overall Strategy
This solution employs an analytical framework of "dual-engine parallel prediction + three-tier independent validation + coordinate system integration," using att-site coordinate accuracy (deviation ≤±100 bp) and PhiSpy global validation rate (100%) as quality control thresholds. The overall pipeline comprises five main phases: dual-engine initial prediction, CheckV completeness assessment, PHASTEST functional annotation, PhiSpy independent validation, and dual-scheme coordinate integration with att site refinement.
3.2 Pipeline Flowchart

Figure 2. Integrated prophage precision prediction pipeline flowchart (dual-engine prediction + three-tier independent validation framework)
3.3 Operational Standards for Each Phase
3.3.1 Input Data Quality Requirements
Att Site Refinement: Extend ±100 bp at both ends of all final coordinates to ensure complete inclusion of upstream and downstream regulatory sequences flanking attL and attR sites, providing sufficient att sequence context for primer design.
Table 2. Input Data Quality Requirements

4. Technical Advantages
4.1 Comparison with Standard Annotation Approaches

Figure 3. Comparison of key metrics between the GeneRulor precision prediction pipeline and standard annotation approaches
Table 3. Key metric comparison with standard annotation approaches

4.2 Core Technical Advantages
Table 4. Core Technical Advantages

5. Application Scenarios

Figure 4. Major application scenarios for prophage precision prediction
(1) Synthetic Biology and In Vitro Phage Rebooting: Design specific amplification primers based on precise attL/attR coordinates, obtain complete prophage genome sequences for in vitro assembly and functional rebooting, providing functional materials for phage therapy and endolysin development.
(2) Probiotic Genome Stability and Safety Assessment: Systematically evaluate the completeness and induction risk of prophages carried by industrial lactic acid bacterial strains (e.g., Lacticaseibacillus casei, Lactobacillus gasseri), providing data support for strain development and production quality control.
(3) Clinical Pathogenomics: Precisely annotate prophages encoding virulence factors (stx, ctx, etc.) and antibiotic resistance genes (AMR) in pathogens, revealing mechanisms of horizontal virulence transfer and supporting epidemiological tracing and clinical prevention strategies.
(4)Microbial Ecology and Evolutionary Research: Quantitatively assess horizontal gene transfer (HGT) frequency within microbial communities, construct phage-host coevolution networks, and elucidate the regulatory roles of prophages in microbiome dynamics.
(5) Advanced Annotation Module for Bacterial Finished Genomes: As a standard value-added analysis module for ONT third-generation sequencing bacterial finished genome reports, it provides high-precision prophage prediction in addition to routine genome annotation, enhancing the scientific depth and application value of reports.
6.Validation Example
6.1 Validation Dataset Overview
This solution used ONT R10.4.1 sequencing finished genomes of 4 Lactobacillus strains (Lacticaseibacillus casei, Lactobacillus gasseri, L. gasseri2, L. murinus) as the validation dataset.
6.2 Prophage Prediction Results
A total of 13 high-quality prophage candidates were identified across the 4 strains, all meeting quality control thresholds: 100% PhiSpy validation rate, all PHASTEST completeness ratings as Intact, and all CheckV completeness scores ≥Medium-quality (≥50%). Among these, 6 were confirmed by both prediction schemes, and 7 had final boundaries determined by PHASTEST evaluation .
Table 5. Summary of prophage prediction results for 4 Lactobacillus strains (illustrative; subject to actual analysis report)

6.3 PHASTESTPHASTEST Prophage Functional Structure

Figure 5.5 Prophage Functional Structure Diagram
6.4 Quality Control Metrics Achievement
Table 6. Quality Control Metrics Achievement

7. Service Content and Sample Requirements
7.1 Standard Deliverables
(1) High-Precision Coordinate Table: Complete coordinate list for all prophages (including attL/attR ±100 bp flanking coordinates), confidence scores from each tool, and coordinate source documentation;
(2) att Site Sequence Files: Precise attL/attR coordinates and flanking FASTA sequences, with primer design recommendations;
(3) Completeness QC Report: Comprehensive quality control report combining CheckV quantitative scores and PHASTEST standardized classification;
(4) Functional Genomic Maps: PHASTEST genome structure visualization and functional gene annotation summary (integrase/structural proteins/lysin, etc.);
(5) Raw Analysis Data Package: Complete output files from geNomad, PIDE, CheckV, and PhiSpy, ensuring full result reproducibility.
7.2 Service Workflow
Table 7. Service Workflow

7.3 Sample Submission Requirements
Table 8. Sample Submission Requirements

8. References
[1] Camargo AP, Roux S, Schulz F, et al. Identification of mobile genetic elements with geNomad. Nat Biotechnol. 2023;41(10):1303–1312.
[2] Zhou C, Zhang Y, Liu Y, et al. Highly accurate prophage island detection with PIDE. Genome Biology. 2025;26:45.
[3] Nayfach S, Camargo AP, Schulz F, et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol. 2021;39:578–585.
[4] Arndt D, Marcu A, Liang Y, et al. PHASTEST: faster, more accurate and visually stimulating prophage identification. Nucleic Acids Res. 2023;51(W1):W549–W557.
[5] Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 2012;40(16):e128.
[6] Sereika M, Kirkegaard RH, Karst SM, et al. Oxford Nanopore R10.4 long-read sequencing enables near-perfect bacterial genomes. Nat Methods. 2022;19:823–826.
[7] Kolmogorov M, Yuan J, Lin Y, et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37:540–546.