41 research outputs found
Genomic prediction using preselected DNA variants from a GWAS with whole-genome sequence data in Holstein–Friesian cattle
<p>Background: Whole-genome sequence data is expected to capture genetic variation more completely than common genotyping panels. Our objective was to compare the proportion of variance explained and the accuracy of genomic prediction by using imputed sequence data or preselected SNPs from a genome-wide association study (GWAS) with imputed whole-genome sequence data. Methods: Phenotypes were available for 5503 Holstein-Friesian bulls. Genotypes were imputed up to whole-genome sequence (13,789,029 segregating DNA variants) by using run 4 of the 1000 bull genomes project. The program GCTA was used to perform GWAS for protein yield (PY), somatic cell score (SCS) and interval from first to last insemination (IFL). From the GWAS, subsets of variants were selected and genomic relationship matrices (GRM) were used to estimate the variance explained in 2087 validation animals and to evaluate the genomic prediction ability. Finally, two GRM were fitted together in several models to evaluate the effect of selected variants that were in competition with all the other variants. Results: The GRM based on full sequence data explained only marginally more genetic variation than that based on common SNP panels: for PY, SCS and IFL, genomic heritability improved from 0.81 to 0.83, 0.83 to 0.87 and 0.69 to 0.72, respectively. Sequence data also helped to identify more variants linked to quantitative trait loci and resulted in clearer GWAS peaks across the genome. The proportion of total variance explained by the selected variants combined in a GRM was considerably smaller than that explained by all variants (less than 0.31 for all traits). When selected variants were used, accuracy of genomic predictions decreased and bias increased. Conclusions: Although 35 to 42 variants were detected that together explained 13 to 19% of the total variance (18 to 23% of the genetic variance) when fitted alone, there was no advantage in using dense sequence information for genomic prediction in the Holstein data used in our study. Detection and selection of variants within a single breed are difficult due to long-range linkage disequilibrium. Stringent selection of variants resulted in more biased genomic predictions, although this might be due to the training population being the same dataset from which the selected variants were identified.</p
Mapping QTL influencing gastrointestinal nematode burden in Dutch Holstein-Friesian dairy cattle
BACKGROUND: Parasitic gastroenteritis caused by nematodes is only second to mastitis in terms of health costs to dairy farmers in developed countries. Sustainable control strategies complementing anthelmintics are desired, including selective breeding for enhanced resistance. RESULTS AND CONCLUSION: To quantify and characterize the genetic contribution to variation in resistance to gastro-intestinal parasites, we measured the heritability of faecal egg and larval counts in the Dutch Holstein-Friesian dairy cattle population. The heritability of faecal egg counts ranged from 7 to 21% and was generally higher than for larval counts. We performed a whole genome scan in 12 paternal half-daughter groups for a total of 768 cows, corresponding to the approximately 10% most and least infected daughters within each family (selective genotyping). Two genome-wide significant QTL were identified in an across-family analysis, respectively on chromosomes 9 and 19, coinciding with previous findings in orthologous chromosomal regions in sheep. We identified six more suggestive QTL by within-family analysis. An additional 73 informative SNPs were genotyped on chromosome 19 and the ensuing high density map used in a variance component approach to simultaneously exploit linkage and linkage disequilibrium in an initial inconclusive attempt to refine the QTL map position
Genomic prediction across populations, using pre-selected markers and differential weight models
Genomic prediction (GP) in numerically small breeds is limited due to the requirement for a large reference set. Across breed prediction has not been very successful either. Our objective was to test alternative models for across breed and multi-breed GP in a small Jersey population, utilizing prior information on marker causality. We used data on 596 Jersey bulls from new Zealand and 5503 Holstein bulls from the Netherlands, all of which had deregressed proofs for stature. Two sets of genotype data were used, one containing 357 potential causal markers identified from a multi-breed meta-GWAS on stature (top markers), while the other contained 48,912 markers on the custom 50k chip, excluding the top markers. We used models in which only one GRM (either top markers, 50k, or top plus 50k markers combined) was fitted, and models in which two GRMs (both the top and 50k) were fitted simultaneously, however with different variance components to weight the GRMs differently. Moreover, we estimated the genetic correlation(s) between the breeds (for each GRM) using a multi-trait GP model, which implicitly weights the contribution of one breed’s information to another. Across breed, we observed low accuracies of GP when the 50k markers were fitted alone (0.06) or when the top markers were added to 50k (0.15). Higher accuracy was obtained when only the top markers were fitted (0.21), whereas the highest accuracy was obtained when fitting 50k and top markers simultaneously as two independent GRMs (0.25). Multi-breed prediction outperformed both within and across breed prediction with accuracies ranging from 0.34 to 0.45, with the same trend as in across breed prediction. Based on our results, the best approach for across and multi-breed GP is to fit models that are able to isolate and differentially weight the most important markers for the trait. Keywords: Across breed genomic prediction, marker pre-selection, multi-trait model, sequence data
Effects of the number of markers per haplotype and clustering of haplotypes on the accuracy of QTL mapping and prediction of genomic breeding values
The aim of this paper was to compare the effect of haplotype definition on the precision of QTL-mapping and on the accuracy of predicted genomic breeding values. In a multiple QTL model using identity-by-descent (IBD) probabilities between haplotypes, various haplotype definitions were tested i.e. including 2, 6, 12 or 20 marker alleles and clustering base haplotypes related with an IBD probability of > 0.55, 0.75 or 0.95. Simulated data contained 1100 animals with known genotypes and phenotypes and 1000 animals with known genotypes and unknown phenotypes. Genomes comprising 3 Morgan were simulated and contained 74 polymorphic QTL and 383 polymorphic SNP markers with an average r2 value of 0.14 between adjacent markers. The total number of haplotypes decreased up to 50% when the window size was increased from two to 20 markers and decreased by at least 50% when haplotypes related with an IBD probability of > 0.55 instead of > 0.95 were clustered. An intermediate window size led to more precise QTL mapping. Window size and clustering had a limited effect on the accuracy of predicted total breeding values, ranging from 0.79 to 0.81. Our conclusion is that different optimal window sizes should be used in QTL-mapping versus genome-wide breeding value prediction
Utility of whole-genome sequence data for across-breed genomic prediction
Background: Genomic prediction (GP) across breeds has so far resulted in low accuracies of the predicted genomic breeding values. Our objective was to evaluate whether using whole-genome sequence (WGS) instead of low-density markers can improve GP across breeds, especially when markers are pre-selected from a genome-wide association study (GWAS), and to test our hypothesis that many non-causal markers in WGS data have a diluting effect on accuracy of across-breed prediction.
Methods: Estimated breeding values for stature and bovine high-density (HD) genotypes were available for 595 Jersey bulls from New Zealand, 957 Holstein bulls from New Zealand and 5553 Holstein bulls from the Netherlands. BovineHD genotypes for all bulls were imputed to WGS using Beagle4 and Minimac2. Genomic prediction across the three populations was performed with ASReml4, with each population used as single reference and as single validation sets. In addition to the 50k, HD and WGS, markers that were significantly associated with stature in a large meta-GWAS analysis were selected and used for prediction, resulting in 10 prediction scenarios. Furthermore, we estimated the proportion of genetic variance captured by markers in each scenario.
Results: Across breeds, 50k, HD and WGS markers resulted in very low accuracies of prediction ranging from − 0.04 to 0.13. Accuracies were higher in scenarios with pre-selected markers from a meta-GWAS. For example, using only the 133 most significant markers in 133 QTL regions from the meta-GWAS yielded accuracies ranging from 0.08 to 0.23, while 23,125 markers with a − log10(p) higher than 7 resulted in accuracies of up 0.35. Using WGS data did not significantly improve the proportion of genetic variance captured across breeds compared to scenarios with few but pre-selected markers.
Conclusions: Our results demonstrated that the accuracy of across-breed GP can be improved by using markers that are pre-selected from WGS based on their potential causal effect. We also showed that simply increasing the number of markers up to the WGS level does not increase the accuracy of across-breed prediction, even when markers that are expected to have a causal effect are included
Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection
Error rate for imputation from the Illumina BovineSNP50 chip to the Illumina BovineHD chip.
BACKGROUND: Imputation of genotypes from low-density to higher density chips is a cost-effective method to obtain high-density genotypes for many animals, based on genotypes of only a relatively small subset of animals (reference population) on the high-density chip. Several factors influence the accuracy of imputation and our objective was to investigate the effects of the size of the reference population used for imputation and of the imputation method used and its parameters. Imputation of genotypes was carried out from 50 000 (moderate-density) to 777 000 (high-density) SNPs (single nucleotide polymorphisms). METHODS: The effect of reference population size was studied in two datasets: one with 548 and one with 1289 Holstein animals, genotyped with the Illumina BovineHD chip (777 k SNPs). A third dataset included the 548 animals genotyped with the 777 k SNP chip and 2200 animals genotyped with the Illumina BovineSNP50 chip. In each dataset, 60 animals were chosen as validation animals, for which all high-density genotypes were masked, except for the Illumina BovineSNP50 markers. Imputation was studied in a subset of six chromosomes, using the imputation software programs Beagle and DAGPHASE. RESULTS: Imputation with DAGPHASE and Beagle resulted in 1.91% and 0.87% allelic imputation error rates in the dataset with 548 high-density genotypes, when scale and shift parameters were 2.0 and 0.1, and 1.0 and 0.0, respectively. When Beagle was used alone, the imputation error rate was 0.67%. If the information obtained by Beagle was subsequently used in DAGPHASE, imputation error rates were slightly higher (0.71%). When 2200 moderate-density genotypes were added and Beagle was used alone, imputation error rates were slightly lower (0.64%). The least imputation errors were obtained with Beagle in the reference set with 1289 high-density genotypes (0.41%). CONCLUSIONS: For imputation of genotypes from the 50 k to the 777 k SNP chip, Beagle gave the lowest allelic imputation error rates. Imputation error rates decreased with increasing size of the reference population. For applications for which computing time is limiting, DAGPHASE using information from Beagle can be considered as an alternative, since it reduces computation time and increases imputation error rates only slightly
Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection
Background: Use of whole-genome sequence data is expected to increase persistency of genomic prediction across generations and breeds but affects model performance and requires increased computing time. In this study, we investigated whether the split-and-merge Bayesian stochastic search variable selection (BSSVS) model could overcome these issues. BSSVS is performed first on subsets of sequence-based variants and then on a merged dataset containing variants selected in the first step. Results: We used a dataset that included 4,154,064 variants after editing and de-regressed proofs for 3415 reference and 2138 validation bulls for somatic cell score, protein yield and interval first to last insemination. In the first step, BSSVS was performed on 106 subsets each containing ~39,189 variants. In the second step, 1060 up to 472,492 variants, selected from the first step, were included to estimate the accuracy of genomic prediction. Accuracies were at best equal to those achieved with the commonly used Bovine 50k-SNP chip, although the number of variants within a few well-known quantitative trait loci regions was considerably enriched. When variant selection and the final genomic prediction were performed on the same data, predictions were biased. Predictions computed as the average of the predictions computed for each subset achieved the highest accuracies, i.e. 0.5 to 1.1 % higher than the accuracies obtained with the 50k-SNP chip, and yielded the least biased predictions. Finally, the accuracy of genomic predictions obtained when all sequence-based variants were included was similar or up to 1.4 % lower compared to that based on the average predictions across the subsets. By applying parallelization, the split-and-merge procedure was completed in 5 days, while the standard analysis including all sequence-based variants took more than three months. Conclusions: The split-and-merge approach splits one large computational task into many much smaller ones, which allows the use of parallel processing and thus efficient genomic prediction based on whole-genome sequence data. The split-and-merge approach did not improve prediction accuracy, probably because we used data on a single breed for which relationships between individuals were high. Nevertheless, the split-and-merge approach may have potential for applications on data from multiple breeds.</p
MOESM4 of Genomic prediction using preselected DNA variants from a GWAS with whole-genome sequence data in HolsteinâFriesian cattle
Additional file 4: Figure S4. Manhattan plot for three chromosomes and PY using ISQ variants and the variants excluded from GRMc based on LD within a 2-Mb window on either side of each selected variant (green)
MOESM1 of Genomic prediction using preselected DNA variants from a GWAS with whole-genome sequence data in HolsteinâFriesian cattle
Additional file 1: Figure S1. QâQ plot for PY using the variants from the Bovine 50k (A), BovineHD (B) and the full imputed sequence data (C)
