Assessing host-specificity of Escherichia coli using a supervised learning logic-regression-based analysis of single nucleotide polymorphisms in intergenic regions

Citation

Zhi, S., Li, Q., Yasui, Y., Edge, T., Topp, E., Neumann, N.F. (2015). Assessing host-specificity of Escherichia coli using a supervised learning logic-regression-based analysis of single nucleotide polymorphisms in intergenic regions. Molecular Phylogenetics and Evolution, [online] 92 72-81. http://dx.doi.org/10.1016/j.ympev.2015.06.007

Abstract

Host specificity in E. coli is widely debated. Herein, we used supervised learning logic-regression-based analysis of intergenic DNA sequence variability in E. coli in an attempt to identify single nucleotide polymorphism (SNP) biomarkers of E. coli that are associated with natural selection and evolution toward host specificity. Seven-hundred and eighty strains of E. coli were isolated from 15 different animal hosts. We utilized logic regression for analyzing DNA sequence data of three intergenic regions (flanked by the genes uspC-flhDC, csgBAC-csgDEFG, and asnS-ompF) to identify genetic biomarkers that could potentially discriminate E. coli based on host sources. Across 15 different animal hosts, logic regression successfully discriminated E. coli based on animal host source with relatively high specificity (i.e., among the samples of the non-target animal host, the proportion that correctly did not have the host-specific marker pattern) and sensitivity (i.e., among the samples from a given animal host, the proportion that correctly had the host-specific marker pattern), even after fivefold cross validation. Permutation tests confirmed that for most animals, host specific intergenic biomarkers identified by logic regression in E. coli were significantly associated with animal host source. The highest level of biomarker sensitivity was observed in deer isolates, with 82% of all deer E. coli isolates displaying a unique SNP pattern that was 98% specific to deer. Fifty-three percent of human isolates displayed a unique biomarker pattern that was 98% specific to humans. Twenty-nine percent of cattle isolates displayed a unique biomarker that was 97% specific to cattle. Interestingly, even within a related host group (i.e., Family: Canidae [domestic dogs and coyotes]), highly specific SNP biomarkers (98% and 99% specificity for dog and coyotes, respectively) were observed, with 21% of dog E. coli isolates displaying a unique dog biomarker and 61% of coyote isolates displaying a unique coyote biomarker. Application of a supervised learning method, such as logic regression, to DNA sequence analysis at certain intergenic regions demonstrates that some E. coli strains may evolve to become host-specific.

Publication date

2015-11-01

Author profiles