Skip to main content

+36 1 463 5918 | firepharma@mail.bme.hu

Chemometrics

Data preprocessing

Chemometrics can only be applied if the investigated data is comparable to the calibration and to each other. But the data may carry variations caused by different measurement conditions or other environmental factors, which can negatively affect the chemometric models. To reduce the effect of these artifacts data preprocessing is used. The preprocessing is one of the most crucial part of the model building. It always simplifies the data, therefore it can erase valuable information. The properly chosen data preprocessing reduces the effect of unwanted artifacts, but keep every information that is related to the quality attributes.

The most commonly used preprocessing steps are baseline correction, normalization, derivation and smoothing, standardization, mean centering, standard normal variate, etc.. Some preprocessing steps may be requirements for the following models.

Univariate methods

Classic univariate methods create linear regressions between the explanatory and the dependent variables. The data extracted from the spectra can be a peak intensity, or AUC (Area Under Curve), or in some cases the ratio between peaks.

Multivariate methods

Advanced statistical methods use a larger part or the whole the dataset. These methods are more reliable, as the dependency on measurement inaccuracies is reduced, but in most cases the relation between the measurement and the quality attribute cannot be extracted. These methods include qualitative and quantitative methods.

Qualitative methods can compare the match between datasets, therefore determine the sample components. Pattern recognition or classification methods generate categories and identify which category the observation belongs to with given probability. The most commonly used methods are the principal component analysis (PCA) and the knn models.

Quantitative methods can determine the composition of the sample, and the ratio of the ingredients. Most of these methods require calibration sets, and a validated, reliable reference method. Therefore, these methods cannot exceed the accuracy of the reference methods. These models include Classical Least Squares (CLS) or Multiple Linear Regression (MLR) techniques, but even complex models such as Principal Component Regression (PCR), Partial Least Squares Regression (PLS), and even machine learning, such as Artificial Neural Networks (ANNs).

During model building goodness of fit can be evaluated with statistics have to be investigated, and overfitting of the model has to be evaded. Coefficient of determination, Standard Error of Calibration and Prediction, and Bias are the most commonly used goodness-of-fit statistics. Cross-validation and Validation should be used during model building. Overfitting is obvious, when the quality attributes for validation do not improve compared to the quality attributes for calibration.

Variable Selection

Even though, multivariate data analysis has a better performance due to the large amount of data included in the models, inadequate, unnecessary data should be removed to reach the best models. Variable selection can be done manually, by examining the changes in the data, and selecting the parts that change accordingly. However, automatic methods, such as Genetic Algorithm (GA), or interval PLS can automatically select the variables that provide the best models. These methods are based on excessive model building, with different subset of variables, and on a fitness parameter, that determines the quality of the model.

Variable selection is another method to reduce the effect of the artefacts, mentioned in data preprocessing. Uncareful variable selection can lead to overfitting.