prev: Note of data science training EP 12: skimage – Look out carefully

There are always outliers in the data. How can we deal with overfitting model?


We talked in EP 7 about 2 words related to outliers that can ruin our models.

Bias is the state of inaccuracy. To solve this bias, we have to do proper data exploration and preparation.

Variance is the state of dissimilarity. The data with high variance is so challenging to find its patterns. We can measure the variance by MSE. To solve this, we can manage to gather more data, optimize features, change models, or apply regularization.

Regularization

\(L_0\) is raw data.

\(L_1\) also called Lasso Penalty. It will remove some unnecessary features by adding 0 as their weights. As a result, the formula will be shorten and the performance will be improved.

$$L_1 = \|w\|_1 = (|w_0|+|w_1|+\cdots+|w_n|) = \sum_{i=1}^n|w_i|$$

\(L_2\) a.k.a. Ridge Penalty. This will tune weights of features to reduce variance.

$$L_2 = \|w\|_2=(w_0^2+w_1^2+\cdots+w_n^2)^{1/2}={\left(\sum_{i=1}^n{|w_i|}^2\right)}^\frac{1}{2}$$

Another one not showing here is Elastic Net that is the mixture of Lasso and Ridge.

Let’s begin

1. Prepare the data

Let’s say we already load the wine data into a DataFrame.

and utilize train_test_split split data to train set and test set. Assign “y” as “quality”.

from sklearn.model_selection import train_test_split
x = # some dataframe
y = # some dataframe
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.75)

2. Standard Scaler

Regularization computes on size or magnitude of the data so we need to scale them. This job can be done with StandardScaler.

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss_train = ss.fit_transform(x_train)
ss_test = ss.transform(x_test)

3. Linear Regression

Now we can create a Linear Regression model with the data from standard scaler.

lr = LinearRegression()
lr.fit(ss_train, y_train)

4. Metrics of Linear Regression

After running .fit(), we noticed MSE and R2 score of the test set is worse than train set. It is overfitting.

from sklearn.metrics import mean_squared_error, r2_score
print(mean_squared_error(y, lr.predict(ss)))
print(r2_score(y, lr.predict(ss)))

5. Lasso

OK. We are going to Lasso. Alpha is coefficient of the above formula. Giving alpha as 0.1 (default alpha value is 1.0)

from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(ss_train, y_train)
print(lasso.score(ss_train, y_train))

Then R2 score from Lasso.score() is below.

6. LassoCV

We can run LassoCV to find the best alpha.

from sklearn.linear_model import LassoCV
lasso_cv = LassoCV(alphas=np.logspace(-1, 1, 100), cv=5, max_iter=5000)
lasso_cv = lasso_cv.fit(ss_train, y_train.values.ravel())
print(lasso_cv.alpha_)
print(lasso_cv.coef_)
print(lasso_cv.score(ss_train, y_train))

np.logspace(-1, 1, 100) generates an array for 100 elements from \(10^{-1}\) to \(10^1\) . It will be a list of alpha. cv is cross-validation for the algorithm and max_iter defines maximum iteration in the calculation. Finally we got R2 score of test set equals 0.2368.

Look at .coef_, there are many zero. Yes, Lasso gets rid of unnecessary features.

7. Ridge

Move to Ridge.

from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1)
ridge.fit(ss_train, y_train)
print(ridge.score(ss_train, y_train))

8. RidgeCV

We can run RidgeCV to find the best alpha and define scoring as R2 score.

Here we are completed with the regularization. Our regularized models can produce the MSE score in less on test and real data. That’s the point.

References


Next episode is the epilogue of this series.

See ya there.

next: Note of data science training EP 14 END – Data scientists did their mistakes