Trios-based inquiry into de novo copy number variants in the swine genome

Magdalena Frąszczak; Błażej Nowak; Martyna Kaźmierczak; Magda Mielczarek

doi:10.2478/aoas-2025-0120

Full Article

The pig is one of the most economically important livestock species worldwide (Jiang et al., 2014), serving as an excellent model organism for research focused on human health and disease, contributing to the understanding of human phenotypes and disease. Indeed, the anatomy, physiology, and immunology of pigs are comparable to those of humans. Furthermore, the pig genome shows a higher similarity to the human genome than the mouse, which is considered a standard model organism (Pabst, 2020; Walters and Prather, 2013).

Genomes contain many types of deoxyribonucleic acid (DNA) polymorphisms. Such genetic variation manifests in different genomic dimensions, from single-nucleotide polymorphisms (SNPs) to large-scale structural variants (SVs), and often involves variations in the number of copies of long genomic segments. CNVs are a particular subtype of SVs caused by deletions or duplications that range from approximately 50 base pairs (bp) to several megabase pairs (Mb). In recent years, CNVs were most commonly detected in porcine genomes using array-based comparative genomic hybridisation (aCGH; see Fadista et al., 2008; Li et al., 2012; Wang et al., 2014), SNP arrays (Ramayo-Caldas et al., 2023; Wang et al., 2012; 2013; Xu et al., 2023), or whole genome sequence (WGS) data (Esteve-Codina et al., 2013; Liu et al., 2017; Paudel et al., 2013; Revilla et al., 2017; Wang et al., 2017; Zheng et al., 2020; Qian et al., 2023; Jang et al., 2023).

Since CNVs encompass more nucleotide sequences than SNPs, they have greater potential to impact phenotypic variation and disease susceptibility by altering genes and gene promoter sequences, for example. In pigs, genes related to disease (Long et al., 2016), olfaction, neurological processes (Paudel et al., 2013; 2015) coat colour (Giuffra et al., 2002), fatty acid composition (Revilla et al., 2017), and production performance (Jiang et al., 2014) are reportedly enriched in CNVs. Furthermore, Ramayo-Caldas et al. (2023) found associations between porcine CNV and the diversity and composition of pig faecal microbiota.

All polymorphisms, including CNVs, are inherited or can arise de novo. Hehir-Kwa et al. (2011) analysed one of the largest cohorts (3,443 individuals) available for studying de novo CNVs using SNP microarrays and observed a significant association between paternal age and de novo CNV mutation rate. Interestingly, a recent large-scale aCGH analysis of 2,323 individuals by Wadhawan et al. (2020) found a significant association between maternal age and de novo CNV mutation rate. In humans, de novo CNVs are linked to multiple neurological diseases such as schizophrenia (Kirov et al., 2012), intellectual disability (Gilissen et al., 2014), autism spectrum disorder (Sanders et al., 2015), and neurodevelopmental disorders (Hamanaka et al., 2022). Arias et al. (2023) studied the de novo formation of CNVs in pigs using a sample of 478 parent-offspring trios, though SNP array genotyping only provided a low-accuracy estimation of CNV breakpoint positions. Meanwhile, Steensma et al. (2023) examined de novo SVs in commercial pig lines based on the WGS data of 37 trios, but did not focus on SV distribution. Instead, they aimed to highlight the potential of livestock breeding programmes to provide a suitable population structure for de novo SV identification and characterisation using ear, hair, and semen samples. However, the Steensma et al. (2023) data are not publicly available, so it is not possible to compare CNVs detected using such a vast dataset with those found in the current study.

Our study aimed to determine the prevalence of de novo formed CNVs in the porcine genome based on WGS data, and characterise the distribution of such polymorphisms across the genome. Moreover, we compared the length and number of de novo CNVs to inherited CNVs and annotated them in a functional context.

Material and methods

Sequenced individuals

The data set consisted of WGSs, obtained using the Illumina HiSeq 2000, of twelve individuals representing the Polish Large White breed. The sequenced individuals represented two unrelated nuclear families comprising sire - dam - single offspring trios (two trios) or sire - dam – two full sibs (four trios) (Figure 1). All animals were housed in one closed piggery, divided into sectors, with standard environmental, microclimatic, and nutritional conditions. The temperature and humidity in the piggery ranged from 20 – 22°C and 70 – 80%, respectively. The animals were fed a commercial diet and had constant access to water. The datasets generated and analysed during the study are available in the National Center for Biotechnology Information database (Bioproject ID: PRJNA1172736) of the Sequence Read Archive (SRA) repository.

Bioinformatic pipeline

CNV identification from WGS data involved (1) raw data quality control using FastQC (Andrews, 2010) and MultiQC (Ewels et al., 2016), (2) quality-based read trimming using Trimmomatic (Bolger et al., 2014), (3) alignment to the Sscrofa11.1 reference genome by the BWA-MEM software (Li and Durbin, 2009), (4) post-alignment processing with the Picard (http://broadinstitute.github.io/picard) and SAMtools (Li et al., 2009) packages, (5) detection of CNV deletions and duplications with CNVnator (Abyzov et al., 2011) and Pindel (Ye et al., 2009), (6) CNV filtration, and (7) functional analysis with Variant Effect Predictor (McLaren et al., 2016), ShinyGO (Ge et al., 2020), and custom scripts. (1) Default parameters were used for FastQC, which included assessments of per-base sequence quality, average sequence quality, sequence duplication levels, and adapter content, among others. MultiQC was used with default settings to combine FastQC reports into a single HTML report. (2) Raw data cleaning involved trimming low-quality reads by applying a four-base sliding window and cutting the read once the average quality in the window dropped below 20 (SLIDINGWINDOW:4:20). Any reads shorter than 60 bp after trimming were removed (MINLEN:60). (3) Alignment to the Sscrofa11.1 reference genome (the assembly version: https://ftp.ensembl.org/pub/release-113/fasta/sus_scrofa/dna/) was performed with default parameters, including a seed length of 19, a mismatch penalty of four, and a gap open penalty of six. The read group ID was attached to every read in the output. (4) BAM files from multiple lanes were first merged using SAMtools merge (i), then sorted by genomic coordinates (ii) with SAMtools sort. Mate-pair information was corrected (iii) using SAMtools fixmate-m, which ensures proper pairing information for downstream analysis. Genome-wide depth of coverage was calculated (iv) using SAMtools depth-a, which reports coverage at all genomic positions, including those with zero coverage. A custom script was then used to compute the average genome coverage (v) from the SAMtools depth output. Additionally, alignment summary statistics (vi), such as total reads, mapped reads, and properly paired reads, were obtained using SAMtools flagstat. (5) Both CNV detection programmes were used with default parameters, except for CNVnator, for which a window size of 200 bp was chosen. As Abyzov et al. (2011) suggest, the larger bin size should be used for the lower genome coverage, which we determined using the CNVnator eval option. (6) Identified raw CNVs were post-processed by (i) removing variants shorter than 50 bp or longer than 1,000,000 bp, (ii) those detected with Pindel software supported by less than three reads, and (iii) those overlapping with gaps in the reference genome. (iv) The final set of CNVs was based on the consensus output from CNVnator and Pindel. In summary, the CNVnator output was used as a baseline, while the validated variants comprised CNVs that had at least 90% overlap with CNVs detected by Pindel, considering the +/− 100 bp breakpoint accuracy. All steps from this section were described in detail by Mielczarek et al. (2023). In this study, only CNVs located on autosomes were further considered. CNVs detected in offspring that were not present in the parents or any other animals were considered de novo. (7) To demonstrate the potential relationships between de novo CNVs and known pig QTL, the overlap between them was also analysed. Swine QTLs from the Sscrofa 11.1 genome were downloaded from the Animal QTL database (http://www.animalgenome.org/cgi-bin/QTLdb/SS/index; release 56; Apr 24, 2025) (Hu et al., 2022), which includes 57,041 known QTL representing 406 different base traits and 1,088 variants of these traits. The custom script was used to find the overlap at the bp level. The Variant Effect Predictor tool was used to annotate de novo CNVs by identifying their locations within or near genes. To explore potential functional implications of gene-level overlaps, we retrieved all affected genes to determine whether any were impacted by more than one large-scale variant. Canonical transcripts overlapping with de novo CNVs were selected for the functional enrichment analysis. The ShinyGO tool was used for exploring enrichment in Gene Ontology (GO) terms representing the molecular function category (Ashburner et al., 2000; The Gene Ontology Consortium et al., 2023) and in pathways defined by the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2017). A p-value cut-off (false discovery rate [FDR]) of 0.05 was used after implementation in ShinyGO.

CNV determined in offspring only, common for full siblings or full siblings and their half siblings, were considered as putative de novo germline mutations occurring in one of the parents and were thus subjected to functional annotation. This subset was not included in the gene-level overlap, QTL, or formal statistical analyses, as our study did not focus on de novo events occurring in the parents. However, we considered this an interesting observation. Given the potential biological relevance of these CNVs, we decided to perform a functional analysis of them. It is worth keeping in mind that without CNV detections performed on the single-cell or at least single-tissue resolution, it is difficult to differentiate which variants are inherited as germline mutations of the parent and which arise de novo in offspring. The systematic differentiation between these two sources of de novo variation would be an interesting follow-up to our study.

Statistical modelling of de novo copy number variants

All tests were performed separately on autosomal deletions and duplications. First, to define the subsequent statistical handling of the data (parametric or non-parametric methods), the null hypothesis of a normal distribution of CNV lengths and counts for both inherited and de novo variants was examined using the Kolmogorov-Lilliefors test for normality. The test implies neither a predefined expected value nor the variance of the normal distribution, and the test statistic is expressed by the maximal absolute difference between the empirical and theoretical normal cumulative distribution functions. The p-value was approximated using the Dallal-Wilkinson formula (Dallal and Wilkinson, 1986). Due to a lack of normality, non-parametric methods were applied. For the assessment of similarities in CNV distribution between pairs of individuals, a genome was divided into 1,000 bp regions, following the approach proposed by Jang et al. (2023). Each region was then classified as containing a CNV(s) when a CNV overlapped with the region, or as CNV-free when there was no overlap. Based on this classification, the Jaccard similarity coefficients were calculated for each pair of individuals: $J = \frac{S_{11}}{S_{11} + S_{10} + S_{01}},$ J = {{{S_{11}}} \over {{S_{11}} + {S_{10}} + {S_{01}}}}, where S₁₁ denotes the number of regions containing a CNV in both individuals, and S₀₁ and S₁₀ denote the number of regions where only one individual contained a CNV. Furthermore, Kruskal multidimensional scaling (MDS) was performed, based on the values of the Jaccard coefficient calculated for all possible pairs of individuals, to visualise the similarities between all animals. The next step determined CNV regions (CNVRs) by merging CNVs of the same type that wholly or partly overlapped between individuals, as applied in Revilla et al. (2017). Moreover, a CNV unique for an individual was also considered a CNVR. CNVRs were then used to construct UpSet plots to present the number of regions common for related animals, as well as the de novo variants. Permutation tests evaluated the null hypothesis of no differences in the number of de novo and inherited CNVs using test statistics from the Wilcoxon signed rank test. The uniform distribution of de novo CNVs along the genome was tested using a Pearson goodness-of-fit test: $χ^{2} = \sum_{i = 1}^{k} \frac{{(n_{i} - n p)}^{2}}{np} ~ χ_{k - 1}^{2}$ {\chi ^2} = \sum\limits_{i = 1}^k {{{{{({n_i} - np)}^2}} \over {np}}\sim\chi_{k - 1}^2} where n_i denotes the distances between consecutive de novo CNVs, n =2,265,774,640 is the total genome length (only autosomal chromosomes), and $p = \frac{1}{d}$ p = {1 \over d} , where d is the number of de novo deletions/duplications.

The null hypothesis of equal lengths of de novo and inherited CNVs was tested using the Mann-Whitney U test: $Z = \frac{U - \frac{km}{2}}{\sqrt{\frac{km (k + m + 1)}{12}}} ~ N (0, 1),$ Z = {{U - {{km} \over 2}} \over {\sqrt {{{km(k + m + 1)} \over {12}}} }}\sim N(0,1), where $U = Σ_{i = 1}^{k} R_{i} - \frac{k (k - 1)}{2}$ U = \Sigma_{i = 1}^k{R_i} - {{k(k - 1)} \over 2} , R_i denotes the rank of the length of the i-th de novo variant in the vector of lengths of all CNVs, while k and m are the number of de novo and inherited CNVs, respectively. All calculations were performed using the corresponding functions implemented in the R package (ade4, ComplexHeatmap, dplyr, ggplot2, numbers, plotify, UpSetR, tidyverse; R Core Team, 2022).

Results

A comprehensive overview of all copy number variants

Most sequenced reads (between 98.11% and 98.60%, depending on the individual) were aligned to the reference genome, while the percentage of properly paired reads ranged from 95.31% to 96.00%. The resulting genome average coverage ranged from 10x to 19x (Additional file 1). The total number of deletions per individual ranged from 325 to 753. For almost all animals, the highest number of deletions (34 – 81) was located on pig chromosome SSC2, although it has almost half the length (151,935,994 bp) of SSC1 (274,330,532 bp). As expected, the lowest number of deletions for each individual (0 – 8) was always on the shortest (55,982,971 bp) autosome (SSC18) (Additional file 2). The number of duplications per individual ranged from 282 to 444, with the highest number (38 – 80) located on SSC7, while SSC18 contained the lowest number of duplications (0 – 3) (Additional file 3). The length of CNVs varied between 600 bp and 195,000 bp (6,662 ± 13,002 bp) for deletions, and between 1,200 bp and 561,400 bp (10,036 ± 33,438 bp) for duplications. Depending on the individual, this comprised 0.11% to 0.21% of the total autosomes length being deleted, and 0.26% to 0.35% of the autosomal genome being duplicated. The lowest total number of CNVs was detected on SSC18 for deletions (48) and duplications (11), while the highest number of deletions (667) was found on SSC2 overlapping this chromosome by 0.75%. However, longer deletions occurred on SSC9, resulting in the highest percentage coverage (0.96%) for this chromosome. Most duplications were identified on SSC7, covering 1.75% of its length (Additional file 4), with the total number and length of CNVs depending on the chromosome.

The CNV patterns within nuclear families exhibited a high degree of similarity, suggesting that a considerable proportion of CNVs are inherited. Full siblings demonstrated greater genomic similarity than unrelated individuals in terms of deletions and duplications (Figures 2 and 3). Moreover, the offspring exhibited a higher degree of similarity to the mother than to the father when considering duplications. However, some variation within the family was still observed and was caused by de novo CNVs.

All individuals had 42 deletions and 66 duplications in common, compared to the reference genome. The highest fractions of shared deletions and duplications were determined among full siblings and varied from 50% to 52% and 70% to 78%, respectively. The frequency of CNVs shared with family depended on the animal, and ranged between 84% – 89% for deletions and 90% – 96% for duplications. De novo variants constituted only 2% – 7% of all identified duplications, and between 9% and 13% of all deletions. The number of CNVRs within nuclear families, as well as the number of de novo CNVs, is presented in Figure 4 (a–d). It is worth noting that there were 29 deletions and seven duplications detected in full-sibling genomes, as well as six deletions and two duplications in the genomes of full-siblings and their half-siblings. Since these polymorphisms were not detected in parental genomes, but were defined in siblings, they could still be considered putative de novo germline mutations occurring in one of the parents.

De novo copy number variant survey

Considering that de novo CNVs make up only 9% of all CNVs, most (14%) were located on SSC1. Significantly fewer CNVs were de novo than inherited (p = 0.008 for deletions and 0.006 for duplications). The frequency of de novo deletions was the highest on chromosomes three (0.236) and 18 (0.32), while the highest frequency of de novo duplications occurred on chromosomes 16 (0.167) and 18 (0.333). The lowest frequency of de novo deletions occurred on SSC11 (0.069), with the lowest duplication frequency occurring on SSC5 (0.005). Despite the potentially higher detrimental impact of deletions over duplications, there were more de novo deletions than de novo duplications, regardless of the chromosome (Figure 5 and Additional file 5).

The distribution of de novo CNVs along the genome was non-uniform (p approximated zero for deletions and duplications). The distance between neighbouring de novo deletions varied from 600 bp to 67,171,600 bp, and duplications varied between 1,800 bp and 84,799,836 bp. The longest distances were observed on SSC8 for both deletions and duplications. In addition, we observed clusters of deletion variants, with only a few separated single deletions (see Figure 6 for deletions and Figure 7 for duplications).

In all individuals, de novo deletions were shorter than inherited deletions (Additional file 6), with length varying considerably among animals (600 bp to 70,000 bp), and the median length of de novo deletions was one-third of the inherited ones. On the other hand, no significant differences were found in the length of de novo and inherited duplications. Tables containing basic statistics for the length of de novo and inherited CNVs can be found in Additional file 7 (divided by individuals) and Additional File 5 (divided by chromosomes).

Functional perspective on de novo copy number variants

In the case of deletions, overlapping CNVs with QTLs were identified on 10 chromosomes for five reproduction traits, including sperm abnormality rate (SSC1 and 12), number of functional sperm (SSC1), teat number (SSC4, 7, 9, 10, 14, and 15), sperm progressive motility (SSC7), and offspring number (SSC13). They were also identified in the longissimus muscle depth (SSC5 and 13), number of ribs (SSC7), and subcutaneous fat thickness (SSC13) production traits, as well as one immune-related trait (interleukin-8 level; SSC15).

In the case of duplications, a relationship was found on nine chromosomes for 14 different traits, including five reproductive traits (number of functional sperm [SSC1 and 4], semen odour [SSC1], boar sexuality score [SSC1], sperm progressive motility [SSC4], and teat number [SSC7, 17, and 18]), six production traits (longissimus muscle area [SSC1], body condition score [SSC1], average daily gain [SSC4], body length [SSC5], number of ribs [SSC7], and meat colour [SSC14]), and three physiological traits (interleukin-12 [SSC13], blood lipase and, immunoglobulin G levels [both SSC15]). Gene-level overlap between de novo CNVs was found for parkin RBR E3 ubiquitin protein ligase (PRKN) (ENSSSCG00000004032; chr1:5465312–6730872) and testis-expressed 14 (TEX14) (ENSSSCG00000017645; chr12:34815476–34912766), which were affected by deletions and duplications. However, the deletions and duplications did not overlap within the gene in either case. Specifically, for ENSSSCG00000004032, three CNVs were identified within the gene boundaries (one deletion [chr1:5908601–5911400; 0.22% for the gene length] and two duplications [chr1:6264201 6273000; 0.69%, and chr1:5514201–5516400; 0.17%]). For ENSSSCG00000017645, one deletion (chr12:34899601–34900400; 0.82%) and one duplication (chr12:34888201–34899600; 11.71%) were located in adjacent but non-overlapping gene segments. A visualisation of all regions in the genomic context is provided in Additional file 8.

Most (127) of the 184 canonical transcripts that overlapped with de novo deletions were located in introns, meaning they had no direct effects on proteins. However, more severe consequences were also identified, including feature truncation (34), transcript ablation (11), and stop lost (11). It is worth noting that not all affected transcripts are well-characterised since they represent novel genes. Nonetheless, severe consequences of de novo deletions were also determined for known genes, including stop codon loss in LOC100515185 acyl-coenzyme A amino acid N-acyltransferase 2 (ENSSSCG00000038171), HGSNAT heparan-alpha-glucosaminide N-acetyltransferase (ENSSSCG00000038960), ELFN1 extracellular leucine-rich repeat and fibronectin type III domain containing 1 (ENSSSCG00000030485), ACTL8 actin-like 8 (ENSSSCG00000024062), and LOC106504900 olfactory receptor 8B3-like (ENSSSCG00000058524). However, no enrichment of GO terms or KEGG pathways was found for de novo deletions.

For de novo duplications affecting 55 canonical transcripts, transcript amplification (24), intron (13), coding sequence variant (12), non-coding transcript exon variant (4), and feature elongation (2) sequence ontologies were identified. The enriched GOs incorporated immunoglobulin receptor binding (GO:0034987; p = 1.7·10⁻²), immune receptor activity (GO:0140375; p = 4.3·10⁻²), antigen binding (GO:0003823; p = 4.3·10⁻²), transmembrane signalling receptor activity (GO:0004888; p = 4.3·10⁻²), and molecular transducer activity (GO:0060089; p = 4.3·10⁻²). Additional file 9 shows GOs with fold enrichment and the number of genes corresponding to the GO term. Furthermore, the KEGG pathway for olfactory transduction was determined (ssc04740; p = 9.8·10⁻³).

Among CNVs that indicate de novo mutations in the parental germline tract (present in siblings and absent in their parents' genomes), 10 deletions and nine duplications overlapped with canonical transcripts. Deletions were mainly located in introns (9). However, one was reported as a feature truncation variant of the LOC100515852 polymeric immunoglobulin receptor-like gene (ENSSSCG00000017235). Considering duplications, transcript amplification (5), intron variants (2), and coding sequence variants (2) were determined according to sequence ontology. Some of the transcripts affected by these duplications correspond to novel genes that are not well characterised, except FCN1 ficolin (collagen/fibrinogen domain containing) 1 (ENSSSCG00000029414; amplification), FCN2 ficolin (collagen/fibrinogen domain containing lectin) 2 (ENSSSCG00000023333; amplification), and CYP4A24 cytochrome P450 family 4 subfamily A member 24 (ENSSSCG00000062158; coding sequence variant).

Discussion

In recent years, extensive research on CNVs has highlighted their important role in population diversity, disease development, and evolution (Pös et al., 2021). It is well-known that CNVs strongly affect phenotypes by changing gene structure, dosage, and regulation (1000 Genomes Project et al., 2011; Geistlinger et al., 2018). The high impact of CNVs is caused by their dimensions, spanning from 50 bp to several Mb, making a single CNV capable of encompassing several genes (Du et al., 2022; Pös et al., 2021). Indeed, there is a corresponding change in gene expression when CNVs occur, with 85% to 95% of CNVs in humans and mice associated with expression changes in the affected genes (Henrichsen et al., 2009; Tang and Amon, 2013).

De novo mutations are a major cause of severe genetic disorders (Acuna-Hidalgo et al., 2016), which explains why the de novo CNVs identified in this study were shorter and less common than inherited ones. Shorter CNVs span smaller regions of the genome, meaning they may have a lower impact on phenotypes or genome stability. Larger inherited deletions may have been retained in the population due to their neutral or beneficial effects, whereas de novo mutations are typically subject to immediate selection pressures due to their potential deleterious impacts (Acuna-Hidalgo et al., 2016). These findings are consistent with our results, with significantly fewer CNVs arising de novo compared to inherited CNVs, and in line with current knowledge (McCarroll et al., 2008; Wen et al., 2022) showing that up to 99% of CNVs are inherited (van Ommen, 2005). The location of mutations is not random across the genome and is determined by multiple factors, including sequence composition and its functional role (Acuna-Hidalgo et al., 2016).

The highest overlap between CNVs and QTLs (for both deletions and duplications) was found for teat number, which is consistent with research showing this trait to be among the 10 with the highest overlap (Reavy et al., 2015; Keel et al., 2019). This may be related to the large number (n = 2,936) of known QTLs for this trait, which results from the fact that selection for increased teat number has been conducted for a long time, and is also necessary because litter size typically exceeds the number of available nipples (Rohrer and Nonneman, 2017; Yang et al., 2023). Although CNVs are primarily the result of replication and recombination events, there are indications that artificial reproductive conditions may indirectly influence their development or perpetuation. In pigs, this may be related to artificial insemination, the most important tool in modern pig breeding, enabling intensive boar selection based on semen quality (Gao et al., 2019; Zhuang et al., 2023). Additionally, Large White breeds are widely used for crossbreeding due to their high reproductive performance and excellent meat production (Zhang et al., 2022). Among these breeds, the FANCM (Fanconi anaemia complementation group M) gene is involved in germ cell development, and mutations can cause male reproductive disorders due to sperm deformation and reduced sperm number and motility (Yin et al., 2019). Our study confirmed this by demonstrating overlap between the CNVs and QTLs related to sperm morphology and physiology. Production traits such as backfat thickness, meat colour, and rib number play a key role in intensive selection in pig breeding. To date, CNVs related to meat colour and backfat thickness have been identified in the pig genome, involving the TGFBR3 (TGF-beta receptor type III) gene (Wang et al., 2015; Zhang et al., 2024). However, rib number appears to be particularly important, especially for pig producers, as a higher number correlates with greater carcass length. The QTL region affecting rib number is located on SSC7, encompassing the vertebrae development-associated (VRTN) gene, which corresponds to our results. This gene plays a key role in spine development, and a specific intron insertion (e.g., g.20311_20312ins291) significantly affects rib count and carcass length (Borchers et al., 2004).

Undoubtedly, the distribution of de novo CNVs along the genome is non-uniform, which was also demonstrated in our study. We observed clusters of variants, with only a few separated single deletions, especially for deletions. Mutational clusters have been identified, and they correspond to multiple de novo mutations in very close vicinity in a single individual (Acuna-Hidalgo et al., 2016; Chan and Gordenin, 2015). Interestingly, the presence of deletions and duplications within the same gene suggests that the gene may be located in a genomic region prone to structural variation. Such regions are often enriched for repetitive elements and segmental duplications, which serve as hotspots for recurrent structural alterations mediated by mechanisms such as non-allelic homologous recombination (NAHR) and related processes (Höps et al., 2024; Lin and Gokcumen, 2019; Paudel et al., 2013; Soto et al., 2023).

Gene-level overlap between de novo CNVs was found for TEX14 and PRKN. Sironen et al. (2011) described the role of TEX14 in spermatogenesis in Yorkshire pigs and highlighted the importance of specific genomic remodelling events as causes for inherited defects. A specific male infertility in Yorkshire pigs, characterised by early meiotic spermatogenic arrest, was linked to a 2 Mb region on SSC12. Sequencing of the candidate gene (TEX14) revealed a 51 bp insertion leading to a premature stop codon. The insertion was likely the result of an original duplication event, followed by recombination and repositioning. This explanation corresponds well with the way structural variations like insertions and duplications arise and evolve in genomes, often involving duplication followed by recombination-mediated rearrangements. Evidence of hotspots in the pig PRKN gene has not been described; however, it is a well-established hotspot for CNVs in humans. The PRKN gene is located in one of several genomic regions of very high deletion frequency (‘hotspots’), where rare deletions are found at frequencies of up to 100-fold higher than the average for the genome as a whole (Toft and Ross, 2010).

CNVs, whether they affect a QTL, a single gene, or the entire chromosome, have been identified as causes of not only diseases and developmental abnormalities, but also as sources of adaptive potential (Tang and Amon, 2013). The latter determines whether an organism can compete for resources and survive changing environmental conditions (Pös et al., 2021). The contribution of environmental factors to the origin of CNVs is still poorly understood. However, research shows that CNVs are enriched for genes associated with environmental factors, i.e., genes that are not critical for the organism's development, but rather facilitate its response to and interaction with an ever-changing environment. This includes, among others, enrichment for immune and inflammatory response genes. According to Tizaoui (2018), infections, chronic inflammation, cellular stress, and free radicals generated by inflammation favour de novo CNV formation. This is in line with our findings for duplications that significantly enriched genes related to immunoglobulin receptor binding, immune receptor activity, and antigen binding. Another example in which de novo duplicated genes tend to encode proteins that interface with the external environment includes those related to olfactory receptors (The Bovine Genome Sequencing and Analysis Consortium et al., 2009). Odours and chemosensory stimuli are detected and identified by olfactory receptors that are crucial for finding food, detecting mates and offspring, recognising territories, and avoiding danger. Olfactory receptor genes are duplicated very widely within mammalian genomes (Chen et al., 2012; Groenen et al., 2012; Moreno-Estrada et al., 2007), suggesting they may be under strong selection (Groenen et al., 2012). Interestingly, pigs have the largest repertoire of functional olfactory receptor genes, indicating their importance for scavenging (Groenen et al., 2012). This is consistent with our findings, such as enrichment of the KEGG olfactory transduction pathway, signalling receptors, and molecular transducer activity. The latter transmits the signal from one side of the membrane to the other to initiate a change in cell activity or state as part of signal transduction, which captures, among others, olfactory receptor activity. No enrichment was determined for de novo deleted genes, though this may be explained by the fact that deletions can have severe consequences, such as the loss of specific genes or regulatory elements. If the deleted region includes important genes, it can alter normal cellular functions and result in developmental abnormalities (Acuna-Hidalgo et al., 2016; Liu and Bickhart, 2012).

It is important to acknowledge the limitations of this study. We analysed a small sample of 12 individuals, which may limit the detection of rare de novo CNVs and complicate the assessment of population-level relevance. However, it is important to note that our study was based on trio data, which is crucial for the accurate detection of de novo CNVs. The trio design allows for distinguishing true de novo variants from inherited ones, significantly reducing false positives and improving accuracy (Boonin et al., 2025). Moreover, studies investigating de novo CNVs based on WGS data in livestock species are relatively rare, and this study represents a valuable early step in applying such analyses to domestic animals. Considering the environmental effects, unlike studies where environmental variability (e.g., diet, stress, exposure to mutagens) may confound the interpretation of CNV formation, our use of standardized and controlled conditions minimizes such effects, allowing for a clearer interpretation of CNVs in the context of genetic rather than environmental factors (Arlt et al., 2012; Feuk et al., 2006). This study relied on blood DNA sequences; therefore, tissue-specific or somatic mosaic CNVs may have been missed. Nevertheless, blood is one of the most accessible and minimally invasive tissues to obtain, making it one of the most widely used biological materials (Svärd et al., 2025). Blood is also widely accepted for identifying germline de novo CNVs, which are expected to be present across all somatic tissues (Krepischi et al., 2012; Pereira et al., 2024; Stadler et al., 2012). Our approach provides a solid foundation for detecting germline CNVs in livestock. Although no experimental validation was performed, the use of appropriate experimental design, well-established CNV detection tools, and stringent filtering criteria increases confidence in the computational results. Moreover, excluding sex chromosomes from the CNV analysis, which may be seen as another limitation of the study, helped avoid technical and biological issues. Detection of CNV on the pig sex chromosomes (SSCX and SSCY) is challenging. The high number of repetitive elements on both chromosomes reduces short-read alignment accuracy. These highly repetitive elements cause ambiguity in short-read mapping because reads originating from repeats can map to multiple locations, reducing mapping quality and increasing false positives or negatives in CNV detection (Bickhart and Liu, 2014; Skinner et al., 2016). Moreover, hemizygosity of SSCX in males and the haploid nature of SSCY introduce complexities in ploidy normalisation and read depth interpretation, which are problematic for CNV detection tools (Keel et al., 2019). Multiple CNV detection methods, especially those based on read depth, are further limited by reduced precision in defining CNV boundaries on sex chromosomes. These limitations may lead to the exclusion of SSCY from CNV analyses. Despite the aforementioned constraints, our study demonstrates the feasibility of detecting de novo CNVs using trio-based designs in livestock and provides valuable insights that can guide future, larger-scale studies.

To conclude, CNV patterns showed a high degree of similarity within nuclear families, indicating that a significant proportion of CNVs are inherited. However, 9% of all CNVs are due to de novo events and contribute to individual variation. No significant difference in the length of de novo and inherited duplications was recorded. However, de novo deletions were shorter than inherited deletions, which has implications for functionally important genomic locations. Despite the potentially greater detrimental impact of deletions compared to duplications, more de novo deletions were retained in the offspring genomes, and their distribution was non-uniform across the genome. The highest CNV-QTL overlap was found for teat number, reflecting strong and long-term selection for this trait. CNV-QTL overlaps were also associated with key reproductive and production traits, suggesting that artificial selection may influence CNV patterns in pigs. In terms of the functional impact on genes, they were primarily located in introns. The presence of multiple CNVs within the same gene suggests it may lie in a genomic region prone to structural variation, often associated with repetitive sequences and recombination hotspots. Notably, such overlap was observed in TEX14 and PRKN, with TEX14 linked to male infertility in pigs and PRKN known as a CNV hotspot in the human genome. Despite structural gene changes, no significant enrichment of GO terms or KEGG pathways was identified for them. However, de novo gene duplications occurred predominantly in genes involved in environmental interactions, particularly those associated with immune responses and olfactory receptor mechanisms.

Trios-based inquiry into de novo copy number variants in the swine genome

Full Article

Paradigm

My account