Evaluation of variant calling tools for large plant genome re-sequencing

Citation

Yao, Z., You, F.M., N'Diaye, A., Knox, R.E., McCartney, C., Hiebert, C.W., Pozniak, C., Xu, W. (2020). Evaluation of variant calling tools for large plant genome re-sequencing. BMC Bioinformatics, [online] 21(1), http://dx.doi.org/10.1186/s12859-020-03704-1

Plain language summary

Discovering single nucleotide polymorphism (SNP) variations from agriculture crop genome sequences has been a widely used strategy for developing genetic markers marker-assisted breeding and others. Accurately detecting SNPs from large polyploid crop genomes such as wheat is crucial and challenging. The intent of this study was to evaluate seven SNP variant calling tools on wheat whole exome capture (WEC) re-sequencing data from allohexaploid wheat. Based on the concordance and receiver operating characteristic (ROC), the Samtools/mpileup variant calling tool with BWA-mem mapping of raw sequence reads outperformed other tests investigated in this study with the wheat WEC sequence data. Collectively, for the complex wheat genome our recommendation is to use BWA-mem and Samtools/mpileup pipeline for SNP calling. This would be a good starting point for other polyploid crop species. There is no need to preprocess the raw reads data before mapping onto reference genome. A recommended SNP filtering is at least 3 reads containing the variant with average QUAL of at least 5. This filtering can be more stringent depending on the needs of the specific study. Our study will provide practical and comprehensive guidance to more accurate and consistent variant identification, ultimately leading to the crop genome variant information for breeding, diversity study, and germplasms genotyping.

Abstract

Background: Discovering single nucleotide polymorphisms (SNPs) from agriculture crop genome sequences has been a widely used strategy for developing genetic markers for several applications including marker-assisted breeding, population diversity studies for eco-geographical adaption, genotyping crop germplasm collections, and others. Accurately detecting SNPs from large polyploid crop genomes such as wheat is crucial and challenging. A few variant calling methods have been previously developed but they show a low concordance between their variant calls. A gold standard of variant sets generated from one human individual sample was established for variant calling tool evaluations, however hitherto no gold standard of crop variant set is available for wheat use. The intent of this study was to evaluate seven SNP variant calling tools (FreeBayes, GATK, Platypus, Samtools/mpileup, SNVer, VarScan, VarDict) with the two most popular mapping tools (BWA-mem and Bowtie2) on wheat whole exome capture (WEC) re-sequencing data from allohexaploid wheat. Results: We found the BWA-mem mapping tool had both a higher mapping rate and a higher accuracy rate than Bowtie2. With the same mapping quality (MQ) cutoff, BWA-mem detected more variant bases in mapping reads than Bowtie2. The reads preprocessed with quality trimming or duplicate removal did not significantly affect the final mapping performance in terms of mapped reads. Based on the concordance and receiver operating characteristic (ROC), the Samtools/mpileup variant calling tool with BWA-mem mapping of raw sequence reads outperformed other tests followed by FreeBayes and GATK in terms of specificity and sensitivity. VarDict and VarScan were the poorest performing variant calling tools with the wheat WEC sequence data. Conclusion: The BWA-mem and Samtools/mpileup pipeline, with no need to preprocess the raw read data before mapping onto the reference genome, was ascertained the optimum for SNP calling for the complex wheat genome re-sequencing. These results also provide useful guidelines for reliable variant identification from deep sequencing of other large polyploid crop genomes.