Detecting copy number variants using Next Generation Sequencing

CNV-banner

Copy number variants can have big consequences on our health – so detecting them reliably and robustly is important – is NGS up to the job?

Copy number variants (CNVs) are responsible for 5-10% of genetic disease – that’s why detecting them reliably and robustly is so important in clinical labs. Next Generation Sequencing (NGS) panels have proven to be extremely useful for clinical applications and can successfully detect many genomic variants, but they can often struggle with CNV detection. For many laboratories this means running multiple workflows to perform ancillary tests for CNV identification adding time and cost to testing. However, by combining expert NGS panel design with state of the art bioinformatic pipelines, it is possible to call CNVs reliably and robustly using a single streamlined solution.

What are copy number variants?

The comparative study of different genomes in the Human Genome Project in 2001 revealed a surprising level of variability between individuals. The findings revealed that around 5% of the human genome consists of redundant DNA segments present multiple times in various locations across the entire human genome. These genomic regions became known as interspersed segmental duplications and their existence proved for the first time, that certain areas of the genome had changed in copy number during human evolution1. These genomic areas are now recognized as being highly susceptible to copy number variation and are frequently studied in relation to human disease.

CNV contributes to population level genetic differences, and along with translocations and inversions, are responsible for causing structural alterations within the human genome. Genetic alterations can range in size, with the most common source of variation involving only single base pair (bp) alterations, however larger scale alterations may involve chromosomal rearrangements.

The term copy number variants refers to intermediate-large scale genomic alterations, which are operationally defined as DNA segments larger than 1000 base pairs but typically less than 5 megabases, equivalent to the cytogenetic level of resolution.2

CNVs can range in size and composition and often involve insertion and deletion events affecting 1 kB-5 Mb genomic regions, because of the length of the genome that they can alter, they are considered an intermediate-large scale source of variation.

CNVs can be classified into two subtypes: copy number polymorphisms (CNPs) or micro-insertions and micro-deletions, dependent on the size of the genome that has been affected. CNPs are smaller and typically span 1-10 kB in length and have been associated with genes encoding for drug detoxification and immunity proteins. Whereas micro-insertions and deletions affect much larger areas of the genome and can be over one million bp in length. Micro-insertions and deletion events have been linked with such as neuro-cognitive health conditions including autism and schizophrenia.2

The BRCA genes are not the only cancer risk genes; there are now over 100 variants recognized to enhance the risk of developing Each year the National Genomic Test (NGT) Directory UK is updated to include all new genes that should be targeted through genomic testing services commissioned by the NHS for cancer diagnosis and treatment management. Recent updates to the NGT directory have seen the addition of five new genes: REST, DLST, SLC25A11, RNF43 and MDH2, which have an association to inherited cancer syndromes. Often, it is the combination as well as the frequency of variants that contribute to the increase in cancer risk.

How does copy number variation arise?

Both forms of CNVs frequently appear close to or within interspersed segmental duplication regions, and even though there are various mechanisms in which CNVs arise, the high proportion of duplicated DNA sequences within the human genome is a major contributing factor.

During meiosis, to initiate homologous recombination, the maternal and paternal chromosomes must align along the metaphase plate. Sequence homology guides the chromosomal alignment process; however, the presence of highly repeated DNA sequences can cause chromosomal misalignment, leading to aberrant recombination events.2 Non-allelic homologous recombination causes gains and losses of DNA segments, leading to copy number variation.3

Detection of CNVs using NGS

Although CNVs are prevalent in the human genome, they can be challenging to detect compared with other sources of genomic variation like single nucleotide polymorphisms (SNPs). CNV identification and reporting for clinical applications often requires running multiple workflows such as microarrays or multiplex ligation-dependent probe amplification (MLPA) alongside Sanger sequencing. Using a single streamline workflow to profile many genes and detect all types of variants simultaneously reduces both time and cost. This can be achieved by combining expert NGS panel design with optimized bioinformatic pipelines, providing a streamlined solution for calling CNVs and eliminating the requirement for any ancillary tests.

The NGS approach of targeted sequencing achieves greater depth of coverage and has lower associated sequencing costs per sample compared with other NGS methods like whole genome sequencing. By using targeted NGS approaches, it is possible to detect multiple variants from various patient samples simultaneously, making it a viable option for clinical use, where there is a need to screen pathogenic variants and identify the genetic basis of health conditions and diseases.

Accurate detection of CNVs from targeted capture NGS data remains challenging, but the ability to robustly identify CNVs of all sizes and accurately distinguish CNV boundaries is becoming increasingly important. The UK National Genetic Testing directory has reported that for hereditary cancer testing, all genes analysed must be tested for CNVs alongside SNPs and insertion-deletions (indels). In addition, NHS England has recently stated that CNV analysis should be conducted by NGS as the primary technique rather than MLPA.

To achieve robust and accurate CNV calling, high coverage and deep sequencing must be achieved, this is made possible using targeted NGS panels and longer read lengths also contribute to improved CNV accuracy. However, the sensitivity of CNV detection is also dependent on the quality of the input DNA sample, library preparation method used and the variant calling strategies implemented within the bioinformatic pipelines.

Various strategies that can be implemented for CNV analysis in the bioinformatic pipeline, these include relative depth of coverage (Figure 1), paired-end mapping, single nucleotide polymorphism allele frequency and split read analysis.4 Followed by de novo and reference genome assembly comparison.5 To enable detection of CNVs of all sizes and improve detection of CNV boundaries, several of these analysis methods may be implemented within the analysis workflow.

CNV-workflow-1024x188-min

Figure 1: Copy number variant detection via depth of coverage. Figure adapted from A. Nord et al.4

The clinical relevance of detecting CNVs, particularly for hereditary cancer testing

CNVs can be inherited in the germline (Table 1) or acquired as somatic mutations and even though most individual variants will go unnoticed in our lifetime, some have strong clinical relevance and are associated with disease.4 In the case of breast cancer, somatic copy number alternations are known to play a major role in the development of the disease, however germline CNVs can cause an even greater risk that is inherited; this is the case with the loss of function variants in the breast cancer susceptibility genes of BRCA1 and CHEK2.6 BRCA1 contains a relatively high number of Alu retrotransposon elements, which causes more CNVs relative to SNVs and indels compared with other breast cancer susceptibility genes such as BRCA2.7 Screening multiple cancer susceptibility genes and the accurate detection of all variants including CNVs is therefore critical, not only for initial diagnosis but also for disease management.

Table 1: Syndromes and diseases associated with germline CNVs.

Williams syndrome 7q11.2 deletion
Autism 16p11.2 deletion (plus many others)
Lynch syndrome MSH2 exon 1-6 deletion (plus many others)
Schizophrenia 16p11.2 duplication, VIPR2 duplication (plus others)
Charcot-Marie-Tooth PMP22 duplication
Alpha thalassemia HBA1 and HBA2 deletions

For clinical variant analysis, the content of the targeted NGS panel should include coverage of key clinically relevant regions which are actionable. In the example of hereditary cancer testing, it may be beneficial to consider profiling genes and variants associated with all hereditary cancer syndromes, to streamline testing. In this case, the panel content should be clinically enhanced to include not only genes associated with common hereditary cancers such as breast or prostate cancer but also of rarer syndromes such as Phaeochromocytoma and paediatric cancers such as Wilms tumor.

For the diagnosis of hereditary cancer syndromes such as Lynch syndrome, it is important that the NGS workflow can distinguish between functional and pseudogenes such as PMS2 and PMS2CL respectively. Lynch syndrome is an autosomal dominant condition caused by germline mutations in mismatch repair genes of PMS2, as well as MLH1, MSH2 and MSH6.8 Individuals found to have pathogenic variants in these genes are at a much greater risk of developing gastrointestinal cancer and primary malignancies in diverse sites at a young age.9

Although multi-gene panels have proven to be a clinically viable solution, many NGS panels struggle to detect key hereditary CNVs and mosaic forms of CNVs; as this requires exceptional hybridization and capture panel performance, accomplished through high on-target and low duplication rates. In conjunction with panel design and performance, the associated bioinformatic pipelines need to provide high precision and recall of all variants without the need for MLPA analysis or additional ancillary tests.

The bioinformatic analysis pipeline should be fully automated and have the capability to integrate with decision support software if required, to enable easy and straightforward reporting of CNVs. The pipelines should detect SNVs, indels and CNVs with high precision and recall across all genes recommended in UK NGT directory (Table 2), including the more challenging genes such as MHS2 and BRCA1 for hereditary cancer testing. To achieve maximum sensitivity and robust calling of CNVs from single exons to whole genes, increased coverage of key clinical CNVs may be required, along with the implementation of multiple CNV detection strategies and the use of several CNV callers to fully validate results.

Table 2: Recommended genes for screening hereditary cancer types - from the NGT directory and covered in the GALEAS Hereditary Plus Panel.

Breast ATM, BARD1, BRCA1, BRCA2, CDH1, CHEK2, NBN, NF1, PALB2, PTEN, STK11, TP53
Colon APC, AXIN2, BMPR1A, CHEK2, EPCAM, GREM1, MLH1, MSH2, MSH6, PMS2, MSH3, MUTYH, NTLH1, POLD1, POLE, PTEN, RNF43, SMAD4, STK11, TP53
Renal BAP1, FH, FLCN, MET, SDHB, VHL
Ovarian ATM, BARD1, BRCA1, BRCA2, CDH1, CHEK2, NBN, NF1, PALB2, PTEN, SKT11, TP53, RAD51C, RAD51D
Prostate ATM, BRCA1, BRCA2, CHEK2, MLH1, MSH2, MSH6, PALB2
Gastric CDH1, KIT, PDGFRA, SDHC, SDHD, SDHA
Brain APC, ATM, MLH1, MSH2, MSH6, PMS2, TP53
Sarcoma EXT1, EXT2, MTAP, NF1, RECQL4, SQSTM1, TP53
Pediatric CDKN1C, CTR9, REST, TRIM28, WT1

To improve CNV detection and minimize background noise during CNV calling, a panel of normals built into the bioinformatics software can be used as a reference model. This gives a representation of the expected read depths across the design panel that a single sample can then be compared to. For clinical use, the panel of normals should be acquired using a large cohort of clinically ‘normal’ samples. These samples should be processed using the same library preparation techniques and using the same sequencing workflow as the samples of interest. The data set from the panel of normals will provide a baseline to call CNVs, thereby dramatically improving the accuracy of CNV calling.

Streamlined workflows for CNV detection

References 

  1. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, et al. Recent segmental duplications in the human genome. Science. 2002;297(5583):1003-7.
  2. Eichler EE. Copy number variation and human disease. Nat Educ. 2008;1(3):1.
  3. Lupski JR. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends in genetics. 1998;14(10):417-22.
  4. Nord A, Salipante SJ, Pritchard C. Copy Number Variant Detection Using Next-Generation Sequencing. In Clinical Genomics 2015 (pp. 165-187). Academic Press.
  5. Xing Y, Dabney AR, Li X, Wang G, Gill CA, Casola C. SECNVs: a simulator of copy number variants and whole-exome sequences from reference genomes. Frontiers in Genetics. 2020 Feb;11:510268.
  6. Hakkaart C, Pearson JF, Marquart L, Dennis J, Wiggins GA, Barnes DR,  Copy number variants as modifiers of breast cancer risk for BRCA1/BRCA2 pathogenic variant carriers. Communications biology. 2022;5(1):1061.
  7. De Brakeleer S, De Grève J, Lissens W, Teugels E. Systematic Detection of Pathogenic Alu Element Insertions in NGS‐Based Diagnostic Screens: The BRCA1/BRCA2 Example. Human Mutation. 2013;34(5):785-91.
  8. G. Idos. Lynch Syndrome. Gene Reviews. 2021;1-43.
  9. Poaty H, Bouya LB, Lumaka A, Mongo-Onkouo A, Gassaye D. PMS2 Pathogenic Variant in Lynch Syndrome-Associated Colorectal Cancer with Polyps. Global Medical Genetics. 2023;10(01):001-5.