Ridge regression (Tikhonov regularization): cost funtion에 l2 regularization ($\alpha \Sigma_i \theta_i^2$)이 추가된 선형 회귀
Lasso regression: cost funtion에 l1 regularization ($ \alpha \Sigma_i \mid \theta_i \mid $)이 추가된 선형 회귀
Elastic net: cost funtion에 l1, l2 regularization을 모두 추가한 선형 회귀
Remarks
본 포스팅은 Hands-On Machine Learning with Scikit-Learn & TensorFlow (Auérlien Géron, 박해선(역), 한빛미디어) 를 기반으로 작성되었습니다.
1. Ridge regression
1) Cost function
\(J(\theta) = \sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2 + \alpha \sum_{i=1}^n\theta_i^2 \quad \text{(Bias } \theta_0 \text{ is not regularized)} \\ = ||\mathbf{y} - \mathbf{\hat{y}}||^2_2 + \alpha ||w||^2_2\)
2) Normal equation
$ \hat{\theta} = (X^TX + \alpha I’)^{-1}X^Ty \quad \text{(} I’ \text{ is } I^{(n+1) \times (n+1)} \text{ whose bias column is 0)}$
3) API function
1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver='cholesky') # solver='saga': improved stochastic average gradient
ridge_reg.fit(X, y) # Normal equation
ridge_reg.predict(X_test)
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, penalty='l2')
sgd_reg.fit(X, y.ravel()) # Stochastic Gradient Descent
sgd_reg.predict(X_test)
2. Lasso regression
1) Cost function
\(J(\theta) = \frac{1}{2}\sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2 + \alpha \sum_{i=1}^n|\theta_i| \quad \text{(Bias } \theta_0 \text{ is not regularized)} \\ = \frac{1}{2}||\mathbf{y} - \mathbf{\hat{y}}||^2_2 + \alpha ||w||_1 \\\)
2) Ridge vs Lasso
Ridge와 달리 lasso는 덜 중요한 feature의 가중치를 완전히 제거하는 feature selection을 자동으로 하여 sparse model을 만듭니다.
3) API function
1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.linear_model import Lasso
ridge_reg = Lasso(alpha=1)
ridge_reg.fit(X, y) # Coordinate descent
ridge_reg.predict(X_test)
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, penalty='l1')
sgd_reg.fit(X, y.ravel()) # Stochastic Gradient Descent
sgd_reg.predict(X_test)
3. Elastic net
1) Cost function
\(J(\theta) = \frac{1}{2}\sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2 + r \alpha \sum_{i=1}^n|\theta_i| + (1-r) \alpha \frac{1}{2}\sum_{i=1}^n\theta_i^2 \quad \text{(Bias } \theta_0 \text{ is not regularized)} \\ = \frac{1}{2}||\mathbf{y} - \mathbf{\hat{y}}||^2_2 + r \alpha ||w||_1 + (1-r) \alpha \frac{1}{2}||w||^2_2 \\\)
2) Ridge vs Lasso vs Elastic net
- 기본: ridge
- 실사용 feature의 개수가 적다: lasso / elastic net
- # features > # samples: elastic net
- 몇 개의 feature가 강하게 연관되어 있다: elastic net
- Weight의 크기를 작게하는 regularization을 사용하면 작은 scale을 가진 feature가 무시될 수 있기 때문에 모든 feature의 scale을 동일하게 해줄 필요성이 있다.
3) API function
1
2
3
4
5
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y) # Coordinate descent
elastic_net.predict(X_test)