In machine learning, the standardization of data is crucial, because most machine learning algorithms are sensitive to the scale of data. For example, ridge and Lasso regression will penalize small parameters, unlike OLS, here the scale of data matters.
In machine learning, the standardization of training data and testing data sets is also important. Sometimes careless standardization will leak the test data information into the training data, which makes the evaluation of models or tuning parameters imprecise.
Here I give three cases of standardization:
(1) Standardize with whole data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)
(2) Standardize respectively
from sklearn.preprocessing import StandardScaler
scaler_train = StandardScaler().fit(x_train)
scaler_test = StandardScaler().fit(x_test)
x_train_scaled = scaler_train.transform(x_train)
x_test_scaled = scaler_test.transform(x_test)
(3) Standardize with train data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)
The recommended standardization process is (3). Because in this process, we only use information in the training data, even when evaluating the performance of trained data. I will give one intuition and two examples.
For intuition, we assume we only obtain one testing observation each time, and of course in this case it is impossible to standardize testing data. Thus, we use mean and standard deviation estimated by train data to standardize the testing data.
With this intuition, one can argue that in practice, we can obtain a lot of testing data, and then we can do as (2). However, in machine learning, we are assuming the training data and testing data are sampled from the same distribution, and in the ML model, we train it with standardized data. This implies that our ML model is embedded with the scale (or say distribution) of training data. If we scale data with testing data, we are putting potentially another distribution into our trained model.
Another example is that, for example, in the case of cross-validation, if we have 10 folds, then for each round we can calculate a testing error. If we scale our data like in (1), some information of testing data for each fold will come into the training data and influence the evaluation of tuning parameters.
In conclusion, in the case of standardization of training and testing data, we should do as (3), standardize both training and testing data with the mean and standard deviation of training data.
Reference:
Comments