Estimation of missing values in a food property database by matrix completion using PCA-based approaches

Citation

Mercier, S., Mondor, M., Marcos, B., Moresoli, C., Villeneuve, S. (2017). Estimation of missing values in a food property database by matrix completion using PCA-based approaches. Chemometrics and Intelligent Laboratory Systems, [online] 166 37-48. http://dx.doi.org/10.1016/j.chemolab.2017.04.008

Plain language summary

In this work, five matrix completion algorithms were investigated for the estimation of missing values in a food property database: iterative principal component analysis with and without early stopping, trimmed scores regression with and without early stopping and variational Bayesian principal component analysis. Matrix completion was applied in the context of a food property database (31 properties × 663 observations) developed by meta-analysis for new food product development, a novel application of matrix completion.

Abstract

© 2017In this work, five matrix completion algorithms were investigated for the estimation of missing values in a food property database: iterative PCA with (IPCAE) and without (IPCA) early stopping, trimmed scores regression with (TSRE) and without (TSR) early stopping and variational Bayesian PCA (VBPCA). Matrix completion was applied in the context of a food property database (31 properties×663 observations) developed by meta-analysis for new food product development, a novel application of matrix completion. The database contained 68.7% of missing values. VBPCA and TSRE were the most accurate algorithms and explained on average 42% and 40%, respectively, of the variance of the missing values. The incorporation of an early stopping step in the TSR and IPCA algorithms decreased overfitting and improved significantly their accuracy. The accuracy of the missing value estimates varied significantly according to the property, and the coefficient of determination for each property with VBPCA ranged from 0.02 to 0.84. The accuracy of the missing value estimates was higher when the property known for only a few observations were included in the database, indicating that the matrix completion algorithms successfully used the additional information that those properties provided to improve the estimation of the other properties in the database. For 17% of the database, the matrix completion algorithms could identify if the missing value was above or below the average value of the property with a confidence level above 90%, providing additional information for product characterization at no experimental cost.