Ridge, Lasso regression, Elastic net

Ridge regression (Tikhonov regularization): cost funtion에 l2 regularization ($\alpha \Sigma_i \theta_i^2$)이 추가된 선형 회귀
Lasso regression: cost funtion에 l1 regularization ($ \alpha \Sigma_i \mid \theta_i \mid $)이 추가된 선형 회귀
Elastic net: cost funtion에 l1, l2 regularization을 모두 추가한 선형 회귀

Remarks

본 포스팅은 Hands-On Machine Learning with Scikit-Learn & TensorFlow (Auérlien Géron, 박해선(역), 한빛미디어) 를 기반으로 작성되었습니다.

1. Ridge regression

1) Cost function

$J(\theta) = \sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2 + \alpha \sum_{i=1}^n\theta_i^2 \quad \text{(Bias } \theta_0 \text{ is not regularized)} \\ = ||\mathbf{y} - \mathbf{\hat{y}}||^2_2 + \alpha ||w||^2_2$

2) Normal equation

$ \hat{\theta} = (X^TX + \alpha I’)^{-1}X^Ty \quad \text{(} I’ \text{ is } I^{(n+1) \times (n+1)} \text{ whose bias column is 0)}$

3) API function

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=1, solver='cholesky')  # solver='saga': improved stochastic average gradient
ridge_reg.fit(X, y)  # Normal equation
ridge_reg.predict(X_test)


from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter=1000, penalty='l2')
sgd_reg.fit(X, y.ravel())  # Stochastic Gradient Descent
sgd_reg.predict(X_test)

2. Lasso regression

1) Cost function

$J(\theta) = \frac{1}{2}\sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2 + \alpha \sum_{i=1}^n|\theta_i| \quad \text{(Bias } \theta_0 \text{ is not regularized)} \\ = \frac{1}{2}||\mathbf{y} - \mathbf{\hat{y}}||^2_2 + \alpha ||w||_1 \\$

2) Ridge vs Lasso

Ridge와 달리 lasso는 덜 중요한 feature의 가중치를 완전히 제거하는 feature selection을 자동으로 하여 sparse model을 만듭니다.

3) API function

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.linear_model import Lasso

ridge_reg = Lasso(alpha=1)
ridge_reg.fit(X, y)  # Coordinate descent
ridge_reg.predict(X_test)


from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter=1000, penalty='l1')
sgd_reg.fit(X, y.ravel())  # Stochastic Gradient Descent
sgd_reg.predict(X_test)

3. Elastic net

1) Cost function

$J(\theta) = \frac{1}{2}\sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2 + r \alpha \sum_{i=1}^n|\theta_i| + (1-r) \alpha \frac{1}{2}\sum_{i=1}^n\theta_i^2 \quad \text{(Bias } \theta_0 \text{ is not regularized)} \\ = \frac{1}{2}||\mathbf{y} - \mathbf{\hat{y}}||^2_2 + r \alpha ||w||_1 + (1-r) \alpha \frac{1}{2}||w||^2_2 \\$

2) Ridge vs Lasso vs Elastic net

기본: ridge
실사용 feature의 개수가 적다: lasso / elastic net
# features > # samples: elastic net
몇 개의 feature가 강하게 연관되어 있다: elastic net
Weight의 크기를 작게하는 regularization을 사용하면 작은 scale을 가진 feature가 무시될 수 있기 때문에 모든 feature의 scale을 동일하게 해줄 필요성이 있다.

3) API function

1
2
3
4
5
from sklearn.linear_model import ElasticNet

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)  # Coordinate descent
elastic_net.predict(X_test)

PREVIOUSEtc