Location:
Long-read WGS Structural Variation Detection

Long-read WGS Structural Variation Detection

Long-read WGS Structural Variation Detection

1. Background

The rapid development of gene editing technologies, particularly DSB (double-strand break)-based tools such as CRISPR/Cas9, has brought revolutionary breakthroughs to gene therapy and cell therapy. However, while DSBs induce targeted editing, they can also trigger large-scale genomic structural variation (SV) at unintended off-target sites or in the vicinity of the target locus, including large deletions, insertions, duplications, inversions, and translocations. These complex genomic rearrangements may cause functional inactivation of endogenous genes, activation of oncogenes, or chromosomal instability, posing serious threats to the safety and efficacy of gene editing products [1].

Conventional next-generation sequencing (NGS) technology is inherently limited in detecting complex SVs due to its short read lengths (typically 150-300 bp). Short reads cannot span breakpoints of large or repeat-region SVs, resulting in insufficient sensitivity for detecting translocations (BND) and complex rearrangements [2]. Therefore, developing a method capable of accurately and comprehensively evaluating gene editing-induced SVs is critically important.

Third-generation sequencing (TGS) technology, particularly Oxford Nanopore Technologies (ONT), offers an ideal solution through its ultra-long read lengths (averaging 16-20 kb) and single-molecule sequencing capabilities. Ultra-long reads can readily span complex SV breakpoints and accurately resolve all types of structural variants, especially in high-repeat regions (such as the HLA locus) and GC-rich regions that are poorly covered by NGS. This protocol is based on Nanopore long-read sequencing and aims to provide a scientifically rigorous and comprehensive SV detection and safety assessment framework for gene editing products.

2. Technical Principles and Detection Workflow

2.1 Nanopore Sequencing Principle

Nanopore sequencing is a single-molecule, real-time sequencing technology based on electrical signal detection. At its core is a nanoscale pore formed by a specialized protein complex embedded in an insulating membrane. Driven by an electric field, single-stranded DNA molecules are threaded through the nanopore by a motor protein at a controlled rate. As different bases (A, T, C, G) pass through the pore, they produce characteristic changes in ionic current. By capturing and decoding these current signals in real time, the base sequence of the DNA molecule can be read directly [3].

Figure 1. Schematic Diagram of Nanopore Sequencing Principle

DNA molecules are unwound by motor proteins and threaded through the nanopore; as different bases translocate, they generate unique ionic current signatures that enable real-time sequence readout.

2.2 Long-read Nanopore SV Detection Workflow

The detection workflow is divided into four major stages: the laboratory stage, data processing stage, SV detection and identification stage, and annotation and assessment stage.

(1) Laboratory Stage: High-quality genomic DNA is extracted from experimental and control samples, quality-checked, and used to construct Nanopore sequencing libraries, followed by long-read sequencing at a depth of >=30x.

(2) Data Processing Stage: Raw sequencing data are quality-controlled using Filtlong, and high-quality long reads are aligned to the reference genome using minimap2 [4].

(3) SV Detection and Identification Stage: Somatic SVs unique to the experimental group are detected using specialized tools such as Severus in Case-Control mode. Concurrently, potential off-target sites of the sgRNA are predicted using tools such as CRISPRme. sgRNA-dependent SVs are identified by comparing SV coordinates with predicted sites within a distance threshold of <50 bp.

(4) Annotation and Assessment Stage: Identified SVs are functionally annotated using tools such as AnnotSV to assess their impact on gene function and potential clinical risk. Cross-validation with orthogonal experimental data such as GUIDE-seq can be performed when available to improve result reliability.

Figure 2. Long-read Nanopore SV Detection Workflow Diagram

2.3 Core Analytical Logic

2.3.1 Structural Variation Detection (Severus)

Severus is an SV detection tool specifically designed for long-read sequencing data, capable of accurately identifying somatic SVs and complex genomic rearrangements [5]. This protocol employs Case-Control mode to precisely identify gene editing-induced somatic SVs by comparing the genomes of experimental and control groups.

2.3.2 sgRNA-dependency Identification

To distinguish gene editing-induced SVs from randomly occurring background variants, sgRNA-dependency identification is performed. First, tools such as CRISPRme are used to predict potential on-target and off-target sites across the entire genome based on the sgRNA and PAM sequences. The genomic coordinates of somatic SVs detected by Severus are then compared against the predicted sites.

2.3.3 SV Annotation and Risk Assessment

Comprehensive functional annotation and risk assessment of identified sgRNA-dependent SVs is a critical step in evaluating gene editing safety. We use specialized tools such as AnnotSV, integrating multiple authoritative databases:

(1) Gene Functional Annotation: Determine whether the SV affects genic regions (exons, introns, etc.) and whether it causes frameshift mutations or gene fusions.

(2) Cancer Relevance Assessment: Cross-reference affected genes against ONCOGENE and TSG (tumor suppressor gene) databases to evaluate potential tumorigenic risk.

(3) Clinical Phenotype Association: Assess the potential pathogenicity of SVs in conjunction with ACMG guidelines and databases such as ClinVar.

3. Technical Advantages

This protocol combines the technical strengths of long-read sequencing with a professional bioinformatics analysis pipeline, offering the following core advantages:

Advantage Dimension

Description

Ultra-long Read Lengths

Average read length of 16-20 kb with N50 reaching 20-27 kb, enabling reads to span complex structural variants and accurately identify translocations, inversions, and other events.

PCR-bias Free

Single-molecule sequencing eliminates the PCR amplification step, avoids amplification bias, preserves original sequence information, and ensures accurate SV detection.

Complex Variant Detection

Designed specifically for long reads, enabling accurate identification of translocations (BND), complex rearrangements, and other SV types that are difficult to detect by NGS.

Coverage of High-repeat Regions

Capable of covering high-repeat regions such as HLA and GC-rich regions, detecting blind spots of NGS and providing a more complete genomic view.

Specialized SV Detection Tools

State-of-the-art SV detection tools such as Severus are used in Case-Control mode to accurately identify somatic SVs and effectively distinguish editing-related from background variants.

Comprehensive Functional Annotation

Specialized tools such as AnnotSV, integrating authoritative databases including ONCOGENE and TSG, comprehensively evaluate the functional impact and clinical risk of each SV.

4. Long-read vs. Short-read Sequencing Comparison

Figure 3. Schematic Comparison of Long-read vs. Short-read Sequencing

Figure 3 visually illustrates the core differences between long-read sequencing (TGS) and short-read sequencing (NGS) in SV detection. Long-read sequencing far outperforms NGS in detecting complex SVs by virtue of its ultra-long reads and PCR-free approach.

Feature

Long-read Sequencing (TGS)

Short-read Sequencing (NGS)

Read Length

Average 16-20 kb, N50 20-27 kb

150-300bp

PCR Amplification

Not required

Required

SV Detection Capability

Strong (especially complex SVs)

Moderate (simple SVs only)

High-repeat Regions

Well covered

Poorly covered

Translocation Detection

Accurate

Challenging

Primary Application

Complex SV detection, HLA region

SNV/InDel detection

5. Application Scenarios

(1) Gene Editing Off-target Effect Assessment: Comprehensively and accurately detect all types of structural variants introduced by CRISPR/Cas9 and other gene editing tools, with a particular focus on translocations and complex rearrangements.

(2) HLA Region Editing Safety Assessment: The HLA locus is a prototypically high-repeat, highly polymorphic region; long-read sequencing enables accurate assessment of editing outcomes and safety in this region.

(3) CAR-T and Cell Therapy Product Development and QC: During product development and manufacturing, comprehensive SV detection of gene-edited cells ensures product safety.

(4) Complex Genomic Region Variant Research: Investigate structural variants in high-repeat regions, centromeres, telomeres, and other areas that are poorly accessible by short-read sequencing.

(5) Complementary to Short-read Sequencing: Combines the SNV/InDel detection strength of short-read sequencing with the SV detection power of long-read sequencing for comprehensive genomic variant coverage.

6. Sample Report

6.1 Data Quality Control and Alignment

To visually present overall data quality, this report shows the frequency distribution of sequencing read lengths (top) and base quality scores (bottom) for each sample.

Figure4. Read Length and Base Quality Score Distribution

The statistics for long-read data after quality control and minimap2 alignment to the reference genome are summarized in the table below:

Samples

Total reads

Mapped reads

Mapping efficient(%)

Sample

5644815

5644592

100

Control

4782121

4781257

99.98

6.2 Key Results

Figure5. Circos Plot of sgRNA-dependent SV Genomic Distribution

Figure6. sgRNA-dependent SV Annotation

7. Service Contents and Sample Requirements

7.1 Service Contents

Service Item

Service Contents

Project Consultation

·Senior technical experts assist in designing a rigorous experimental plan and defining sample and information requirements.

Sample Testing

·Standardized sample quality inspection, library construction, and high-depth long-read sequencing (depth >=30x).

Data Analysis

·Execution of the complete bioinformatics pipeline including data QC, sequence alignment, SV detection, sgRNA-dependency identification, and functional annotation.

Report Delivery

·Delivery of a comprehensive PDF report and complete analysis result files within the committed turnaround time (35-40 business days), including: complete detection analysis report,

·raw sequencing data,

·SV detection results,

·sgRNA-dependent SV list,

·SV annotation results,

·Circos visualization plots.

After-sales Support

·Professional report interpretation and ongoing technical consultation.

7.2 Sample and Requirements

Requirement Category

Item

Specific Requirements

Sample Submission Requirements

Genomic DNA (gDNA)

• Total amount: >= 2 ug (Qubit quantification)

• Concentration: ≥ 50 ng/µL

• Purity: OD260/280 = 1.8-2.0

Cell Samples

• Cell number: >= 5 x 10^6 cells

• Prepared as cell pellets

• Storage: -80 degrees C or liquid nitrogen

Tissue Samples

• Weight: >= 100 mg

• Processing: snap-frozen in liquid nitrogen

• Storage: -80 degrees C or liquid nitrogen

Required Information

sgRNA Information

• Complete 20 nt sgRNA sequence

• PAM sequence

Cas System Information

Specify the Cas protein type used (e.g., SpCas9, SaCas9, etc.)

Sample Information

• Clear sample identifiers

• Description of pairing between edited and control groups

Turnaround Time

Turnaround Time

35-40 business days

8. References

[1] Cosenza, M. R., et al.
(2022). Structural Variation in Cancer: Role, Prevalence, and Mechanisms. Annu Rev Genomics Hum Genet.

[2] Ho, S. S., et al.
(2020). The current landscape of structural variation detection tools. Briefings in Bioinformatics.

[3] Deamer, D., et al.
(2016). Nanopores: a new tool for single-molecule analysis. Chemical Society Reviews.

[4] Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics.

[5] Keskus, A. G., et al.
(2025). Severus detects somatic structural variation and complex rearrangements in cancer genomes using long-read sequencing. Nat Biotechnol.

[6] Cancellieri, S., et al.
(2023). Human genetic diversity alters off-target outcomes of therapeutic gene editing. Nat Genet.

[7] McLaren, W., et al.
(2016). The Ensembl Variant Effect Predictor. Genome Biol.