Pseudogenes and Their Genome-Wide Prediction in Plants

Citation

Xiao J, Sekhwal MK, Li P, Ragupathy R, Cloutier S, Wang X, You FM (2016) Pseudogenes and Their Genome-Wide Prediction in Plants. Int J Mol Sci. 17(12). pii: E1991.

Plain language summary

Pseudogenes are gene copies generated from ancestral functional genes during genome evolutionary. They accumulate mutations in coding sequence such as frameshifts and premature stop codon that may impair their transcription or translation. Generally, pseudogenes are functionless, but recent evidence demonstrates that some of them have potential roles in regulation. There are some bioinformatics tools available for pseudogene prediction, including PseudoPipe, PSF and Shiu’s pipeline. We compared all three tools to predict pseudogenes in Arabidopsis thaliana genome using known 924 pseudogenes as a test data set. PseudoPipe and Shiu’s pipeline identified ~80% of A. thaliana pseudogenes, while PSF failed to generate adequate results. Advanced bioinformatics tools remain needed to improve the accuracy of pseudogene prediction and genome annotation in plants.

Abstract

Pseudogenes are paralogs generated from ancestral functional genes (parents) during
genome evolution, which contain critical defects in their sequences, such as lacking a promoter,
having a premature stop codon or frameshift mutations. Generally, pseudogenes are functionless, but recent evidence demonstrates that some of them have potential roles in regulation. The majority of pseudogenes are generated from functional progenitor genes either by gene duplication (duplicated pseudogenes) or retro-transposition (processed pseudogenes). Pseudogenes are primarily identified by comparison to their parent genes. Bioinformatics tools for pseudogene prediction have been developed, among which PseudoPipe, PSF and Shiu’s pipeline are publicly available. We compared these three tools using the well-annotated Arabidopsis thaliana genome and its known 924 pseudogenes as a test data set. PseudoPipe and Shiu’s pipeline identified ~80% of A. thaliana pseudogenes, of which 94% were shared, while PSF failed to generate adequate results. A need for improvement of the bioinformatics tools for pseudogene prediction accuracy in plant genomes was thus identified, with the ultimate goal of improving the quality of genome annotation in plants.