About Feature Scaling and Normalization
The effect of standardization for machine learning algorithms.
- About Standardization
- About Min-Max scaling
- Z-Score Standardization or Min-Max Scaling
- Standardizing and Normalizing - How it can be done using scikit-learn
- Bottom-up Approaches
- The effect of standardization on PCA in a pattern classification task
- Appendix: The effect of scaling and mean centering of variable prior to PCA
About Standardization
The result of standardization (or Z-Score normalization) is that the features will be re scaled so that they'll have the properties of a standard normal distribution with: $$\mu = 0$$ And $$\sigma = 1$$
Where $\mu$ is the mean(average) and $\sigma$ is the standard deviation from the mean; standard scores (also called Z scores) of the sampels are calculated as follows: $$z = \frac{x - \mu}{\sigma}$$
Standardizing the features so that they are centered around $0$ with a standard deviation of $1$ is not only important if we are computing measurements that have different units, but it is also a general requirement for many machine learning algorithms. Intuitively, we can think of gradient descent as a prominent example (an optimization algorithm) often used in:
Logistic Regression
Support Vector Machine
Perceptrons
Neural Network
with features being on different scales, certain weights may update faster than others since the feature values $x_j$ play a role in the weight updates: $$\Delta w_j = -\eta \frac{\partial J}{\partial w_j} = \eta \sum_i (t^{i} - o^{i})x_j^{i}$$
So that $w_j := w_j + \Delta w_j$, where $\eta$ is the learning rate, $t$ the target class label, and $o$ the actual output. Other intuitive examples include:
K_Nearest Neighbor
Clustering
Alorithms that use, for example: Euclidean Distance Measures
- in fact, tree-based
classifier are probably the only classifiers where feature scaling doesn't make a difference.
In fact, the only family of algorithms that I could think of being scale-invariant are tree-based
method.
Let's take a general CART Dicision Tree
algorithm. Without going into much depth regarding information gain and impurity measures, we can think of the decision as is feature x_i >= some_value
? Intuitively, wen can see that it really doesn't matter on which scale this feature is.
Some examples of algorithms where feature scaling matters are:
-
k-nearest neighbors
withEuclidean Distance
measure if want all features to contribute equally. -
k-means
similar tok-nearest neighbors
-
LinearRegression
,LogisticRegression
,Support Vector Machine
,Perceptrons
,Neural Networks
as long as it using Gradient Descent Optimization. -
Linear Discriminant Analysis(LDA)
,Princial Component Analysis(PCA)
, as long as it want to find directions of maximizing the variance (under the constraints that those directions/eigenvectors
/Principal Components
are orthogonal); you want to have features on the same scale since you'd emphasize variables on "larger measurement scales" more.
In addition, we'd also want to think about whether we want to Standardize or Normalize(here: scaling to [0, 1]
range) our data. Some algorithms assume that our data is centered at 0
. For example, if we initialize the weights of a small multi-layer perceptron with tanh
activation units to 0
or small random values centered around zero, we want to update the model weights equally. As a rule of thumb: When in doubt, just standardize the data, it shouldn't hurt
About Min-Max scaling
An alternative approach to Z-Score normalization (or called standardization) is the so-called Min-Max Scaling (often also simply called Normalization - a common cause for ambiguities)
In this approach, the data is scaled to a fixed range - usually [0, 1]
.
The cost of having this bounded range - in contrast to standrdization - is that we will end up with smaaller standard deviations, which can suppress the effect of outliers.
Note:
If the dataset have lot's of outliers, and the algorithms are sensitive to outliers, please use Min-Max Scaler
A Min-Max Scaling
is typically done via the foloowing equation:
$$X_{norm} = \frac{X_{i} - X_{min}}{X_{max} - X_{min}}$$
$X_i$ is the $i^{th}$ sample of dataset.
Z-Score Standardization or Min-Max Scaling
"Standardization or Min-Max scaling"? - There is no obvious answer to this question: it really depends on the application.
For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis
, where we usually prefer Standardization
over Min-Max Scaling
, since we are interested in the components that maximize the variance(depending on the question and if the PCA computes the components via the correaltion matrix instead of the covariance matrix)
However this doesn't mean that Min-Max Scaling
is not useful at all, A popular application is image processing
, where pixel intensities have to be normalized to fit withint a certain range (i.e., [0, 255]
for the RGB colour range). Also, typical Neural Network Algorithm require data that on a 0 - 1
scale.
scikit-learn
Standardizing and Normalizing - How it can be done using Of course, we could make use of NumPy's vectorization capabilities to calculate the Z-Score
for standardization and normalize the data using the equations that were mentioned in the previous sections. However, there is even more convenient aapproach using the preprocessing module from one of Python's open-source maachine learning library scikit-learn
For the following examples anad discussing, we will have a look at free Wine
Dataset that is deposited on the UCI Machine Learning Repository
Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. The
Wine
dataset consists of3
different classes where each row correspond to aa particular wine sample.
The class label (1, 2, 3)
are listed in the 1st
column, and the columns 2 - 14
correspond to 13
different attributes(features):
- Alcohol
- Malic Acid
- ...
Loading the wine dataset
import pandas as pd
import numpy as np
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df = pd.read_csv(url, header=None, usecols=[0, 1, 2])
df.columns = ['Label', 'Alcohol', 'MalicAcid']
df.info()
df.head()
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
standard_scale = StandardScaler()
df_stdScale = standard_scale.fit_transform(df[['Alcohol', 'MalicAcid']])
df_stdScale[:5]
minmax_scale = MinMaxScaler()
df_minmax = minmax_scale.fit_transform(df[['Alcohol', 'MalicAcid']])
df_minmax[:5]
print(f"The mean of Standard Scaling: {np.mean(df_stdScale):.2f}")
print(f"The Standard Deviation of Standard Scaling: {np.std(df_stdScale):.2f}")
print(f"The min of Min-Max Scaling: {df_minmax.min():.2f}")
print(f"The max of Min-Max Scaling: {df_minmax.max():.2f}")
print(f"The mean of Min-Max Scaling: {np.mean(df_minmax):.2f}")
print(f"The Standard Deviation of Min-Max Scaling: {np.std(df_minmax):.2f}")
%matplotlib inline
from matplotlib import pyplot as plt
def plot():
plt.figure(figsize=(8,6))
plt.scatter(df['Alcohol'], df['MalicAcid'],
color='green', label='Input Scale', alpha=0.5)
plt.scatter(df_stdScale[:, 0], df_stdScale[:, 1],
color='red', label='Standardized [mean=0, std=1]', alpha=0.3)
plt.scatter(df_minmax[:, 0], df_minmax[:, 1], label='min-max scaled [min=0, max=1]', alpha=0.3)
plt.title('Alcohol and Malic Acid content of the wine dataset')
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.legend(loc='upper left')
plt.grid()
plt.tight_layout()
plot()
plt.show()
The plot above includes the wine datapoints on all 3
different scale:
- Input Scale Original Input Scale
-
Standardized Using
sklearn
StandardScaler
-
min-max scaled Using
sklearn
MinMaxScaler
Next Plot we will zoom in each of 3
scale to see the label distribution, they should have the same location/distribution, but just the difference of scale
fig, ax = plt.subplots(3, figsize=(6,14))
for a,d,l in zip(range(len(ax)),
(df[['Alcohol', 'MalicAcid']].values, df_stdScale, df_minmax),
('Input scale',
'Standardized [mean=0, std=1]',
'min-max scaled [min=0, max=1]')
):
for i,c in zip(range(1,4), ('red', 'blue', 'green')):
ax[a].scatter(d[df['Label'].values == i, 0],
d[df['Label'].values == i, 1],
alpha=0.5,
color=c,
label='Class %s' %i
)
ax[a].set_title(l)
ax[a].set_xlabel('Alcohol')
ax[a].set_ylabel('Malic Acid')
ax[a].legend(loc='upper left')
ax[a].grid()
plt.tight_layout()
plt.show()
Bottom-up Approaches
We can also code the equations for standardization and [0, 1]
Min-Max scaling Manually. However, the scikit-learn methods are still useful if you are working with test and training data sets and want to scale them equally.
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train = std_scale.transform(X_train)
X_test = std_scale.transform(X_test)
Here we try to perform the computation using native Python code, and more convenient NumPy
solution, which is especially useful if we attempt to transform the whole matrix.
Recall the equations:
-
Standardization: $$z = \frac{x - \mu}{\sigma}$$
-
Mean: $$\mu = \frac{1}{N}\sum_{i=1}^{N}(x_i)$$
-
Standard Deviation: $$\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}$$
-
Min-Max Scaling: $$X_{norm} = \frac{X_i - X_{min}}{X_{max} - X_{min}}$$
Native Python
x = [1, 4, 5, 6, 6, 2, 3]
mean = sum(x)/len(x)
std_dev = (1/len(x) * sum([(x_i - mean)**2 for x_i in x]))**0.5
z_scores = [(x_i - mean)/std_dev for x_i in x]
print("Native Python")
print(f"Array X: {x}\n")
print(f"Array X mean: {mean}\n")
print(f"Array X Standard Deviation: {std_dev}\n")
print(f"Array X Z Scores: {z_scores}\n")
print("NumPy for cross validation")
print(f"Numpy Array: {x}\n")
print(f"Numpy Array mean: {np.mean(x)}\n")
print(f"Numpy Array Standard Deviation: {np.std(x)}\n")
print(f"Numpy Z Scores: {[(x_i - np.mean(x))/np.std(x) for x_i in x]}\n")
print(f"Min-Max: {[(x_i - min(x))/ (max(x) - min(x)) for x_i in x]}")
import numpy as np
# Standardization
x_np = np.array(x)
z_scores_np = (x_np - x_np.mean()) / x_np.std()
print(f"X NumPy array: {x_np}\n")
print(f"Z scores in Numpy: \n{z_scores_np}")
np_minmax = (x_np - x_np.min()) / (x_np.max() - x_np.min())
print(f"Min-Max scaling in Numpy: \n{np_minmax}")
The effect of standardization on PCA in a pattern classification task
Previous chapter we have mentioned the Principal Component Analysis(PCA)
as an example where standardization is crucial, since it is analyzing the variances of the different features. Now, let's see how the standardization affects PCA
and a following supervised classification on the whole wine dataset.
We will go through the following steps:
-
read_csv
the whole dataset -
train_test_split
split the dataset intotraining
set andtesting
set. - Standardization the features
-
PCA
to reduce dimensionality - Train a classifier
- Evaluate the performance with and without standardization.
Read full dataset
df = pd.read_csv(url, header=None)
df.info()
df.head()
from sklearn.model_selection import train_test_split
X = df.values[:, 1:]
y = df.values[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)
print(f"The shape of X_train is: {X_train.shape}\n")
print(f"The shape of X_test is {X_test.shape}")
from sklearn import preprocessing
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)
Dimensionality reduction via Principal Component Analysis(PCA)
Now, we perform PCA on the standardized
and the non-standardized
datasets to transform the dataset onto a 2-dimensional feature subspace.
In a real application, a procedure like cross-validation would be done in order to find out what choice of features would yield a optimal balance between Preserving Information and Overfitting for different classfiers. However, we will omit this step since we don't want to train a perfect classifier here, but merely compare the effects of standardization
.
from sklearn.decomposition import PCA
# on non-standardized data
pca = PCA(n_components=2).fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
# om standardized data
pca_std = PCA(n_components=2).fit(X_train_std)
X_train_std = pca_std.transform(X_train_std)
X_test_std = pca_std.transform(X_test_std)
Let us quickly visualize how our new feature subspace looks like (note that class labels are not considered in a PCA - in contrast to a Linear Discriminant Analysis - but I will add them in the plot for clarity).
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10,4))
for l,c,m in zip(range(1,4), ('blue', 'red', 'green'), ('^', 's', 'o')):
ax1.scatter(X_train[y_train==l, 0], X_train[y_train==l, 1],
color=c,
label='class %s' %l,
alpha=0.5,
marker=m
)
for l,c,m in zip(range(1,4), ('blue', 'red', 'green'), ('^', 's', 'o')):
ax2.scatter(X_train_std[y_train==l, 0], X_train_std[y_train==l, 1],
color=c,
label='class %s' %l,
alpha=0.5,
marker=m
)
ax1.set_title('Transformed NON-standardized training dataset after PCA')
ax2.set_title('Transformed standardized training dataset after PCA')
for ax in (ax1, ax2):
ax.set_xlabel('1st principal component')
ax.set_ylabel('2nd principal component')
ax.legend(loc='upper right')
ax.grid()
plt.tight_layout()
plt.show()
Before we train a classifier to classifi the class, we can see clearly from the plot that, on the left, the training
set without Standardization , would be difficult to find the decision boundary. But on the right, the training
set with Standardization, the decision boundary is much clearer than left one.
Train a naive Bayes Classifier
We will use a naive Bayes classifier for the classification task. If you not familiar with it, the term naive comes from the assumption that all features are independent with no interference at all, which almost impossible in real-world. it works based on Bayes' Rule:
$$P(w_j | x) = \frac{p(x | w_j) \times P(w_j)}{p(x)}$$
where:
- $w$: class label.
- $P(w|x)$: posterior probability
- $p(x|w)$: prior probability (or likelihood)
And the Decesion Rule:
$$w_1$$ if $$P(w_1|x) > P(w_2|x)$$ else $$w_2$$
$$=\frac{p(x|w_1) \times P(w_1)}{p(x)} > \frac{p(x|w_2) \times P(w_2)}{p(x)}$$
from sklearn.naive_bayes import GaussianNB
# on non-standardized data
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# on standardized data
gnb_std = GaussianNB()
gnb_std.fit(X_train_std, y_train)
from sklearn import metrics
pred_train = gnb.predict(X_train)
print('\nPrediction accuracy for the training dataset without Standardization')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train)))
pred_test = gnb.predict(X_test)
print('\nPrediction accuracy for the test dataset without Standardization')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))
pred_train_std = gnb_std.predict(X_train_std)
print('\nPrediction accuracy for the training dataset with Standardization')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train_std)))
pred_test_std = gnb_std.predict(X_test_std)
print('\nPrediction accuracy for the test dataset with Standardization')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_std)))
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
clf_std = DecisionTreeClassifier()
clf_std.fit(X_train_std, y_train)
pred_train_clf = clf.predict(X_train)
print(f"\nPrediction accuracy for the training dataset without Standardization")
print(f"{metrics.accuracy_score(y_train, pred_train_clf):.2%}")
pred_train_clf_test = clf.predict(X_test)
print("\nPrediction accuracy for the test dataset without Standardization")
print(f"{metrics.accuracy_score(y_test, pred_train_clf_test):.2%}")
pred_train_clf_std = clf_std.predict(X_train_std)
print('\nPrediction Accuracy for the training dataset with Standardization')
print(f"{metrics.accuracy_score(y_train, pred_train_clf_std):.2%}")
pred_train_clf_std_test = clf_std.predict(X_test_std)
print('\nPrediction Accuracy for the test dataset with Standardization')
print(f"{metrics.accuracy_score(y_test, pred_train_clf_std_test):.2%}")
As we can see, the standardization prior to the PCA definitely led to an decrease in the empirical error rate on classifying samples from test dataset.
In Naive Bayes Classifier, before Standardization classifier after PCA
perform to around $64.81\%$ accuracy, after Standardization, performance increase to $98.15\%$
Similar to DecesionTreeClassifier
, test set performance increase from $68.52\%$ to $96.30\%$
Appendix: The effect of scaling and mean centering of variable prior to PCA
Mean centering does not affect the covariance matrix
Here, the rational is: If the covariance is the same whether the variables are centered or not, the result of the PCA will be the same.
Let’s assume we have the 2 variables $x$ and $y$. Then the covariance between the attributes is calculated as
$$\sigma_{xy} = \frac{1}{N-1}\sum_i^n(x_i - \bar{x})(y_i - \bar{y})$$
Let's write the centered variables as: $$x^{'}= x - \bar{x}$$ And $$y^{'}= y - \bar{y}$$
The centered covariance would then be calculated as follows: $$\sigma^{'}_{xy} = \frac{1}{n-1}\sum_i^n(x^{'}_i - \bar{x}^{'})(y^{'}_i - \bar{y}^{'})$$
But since after centering, $\bar{x}^{'}=0$ and $\bar{y}^{'}=0$ we have $\sigma^{'}_{xy} = \frac{1}{n-1}\sum^n_ix_i^{'}y_i^{'}$ which is our original convariance matrix if we resubstitute back the terms $x^{'}=x - \bar{x}$ and $y^{'} = y - \bar{y}$
Even centering only one variable, let's say $X$ wouldn't affect the covariance:
$$\sigma_{xy} = \frac{1}{n-1}\sum_i^n(x_i^{'} - \bar{x}^{'})(y_i - \bar{y})$$ $$=\frac{1}{n-1}\sum_i^n(x_i^{'} - 0)(y_i - \bar{y})$$ $$=\frac{1}{n-1}\sum_i^n(x_i - \bar{x})(y_i - \bar{y})$$
Scaling of variables does affect the covariance matrix
If one variable is scaled, e.g., from Pounds into Kilogram ($1$ pound = $0.453591$ kg), it does affect the covariance and therefore influences the result of PCA.
Let $c$ be the scaling factor for $X$ Given that the original covariance is computed as: $$\sigma_{xy} = \frac{1}{n - 1}\sum_i^n(x_i - \bar{x})(y_i - \bar{y})$$
the covariance after scaling would be computed as: $$\sigma_{xy}^{'} = \frac{1}{n-1}\sum_i^n(c \times x_i - c \times \bar{x})(y_i - \bar{y})$$ $$=\frac{c}{n-1}\sum_i^n(x_i - \bar{x})(y_i - \bar{y})$$
So we have: $$\sigma_{xy} = \frac{\sigma_{xy}^{'}}{c} = C \times \sigma_{xy}$$
Therefore, the covariance after scaling one attribute by the constant $c$ will result in a rescaled covariance $c\sigma_{xy}$ So if we'd scaled $X$ from pounds to kilograms, the covariance between $X$ and $Y$ will be $0.453592$ times smaller.
Standardizing affects the covariance
Standardization of features will have an effect on the outcome of a PCA (assuming that the variables are originally not standardized). This is because we are scaling the covariance between every pair of variables by the product of the standard deviations of each pair of variables.
The equation for standardization of a variable is written as
$$z=\frac{x_i - \bar{x}}{\sigma}$$
The Original covariance matrix:
$$\sigma_{xy} = \frac{1}{n-1}\sum_i^n(x_i - \bar{x})(y_i - \bar{y})$$
And after standardizing both variables:
$$x^{'} = \frac{x - \bar{x}}{\sigma_x}$$ and $$y^{'}=\frac{y-\bar{y}}{\sigma_y}$$
$$\sigma^{'}_{xy} = \frac{1}{n-1}\sum_i^n(x_i^{'} - 0)(y_i^{'}-0)$$ $$=\frac{1}{n-1}\sum_i^n(\frac{x-\bar{x}}{\sigma_x})(\frac{y-\bar{y}}{\sigma_y})$$ $$=\frac{1}{(n-1) \times \sigma_x\sigma_y}\sum_i^n(x_i - \bar{x})(y_i - \bar{y})$$ $$=\frac{\sigma_{xy}}{\sigma_x\sigma_y}$$