Remarks
Scikit-learn에 구현된 model들을 정리한 글입니다.
Notation
- $n$: number of data
- $d$: dimension of features
- $c$: number of classes
- $X$: features
- $y$: target
- $\color{blue}{parameter}$: parameter of estimator
1. Linear regression
1.1 Linear regression(using SVD)
- Algorithm \(h(w) = Xw \\ Minimize \ L(w) ∝ ||y - h(w)||^2_2 \\ \hat w = (X'X)^{-1}X'y ≈ X^+y \\\)
- Time complexity: $O(nd^2 + d^3)$
Feature가 많은 경우 사용하기 어려움
1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression(n_jobs=-1).fit(X, y)
print(reg.score(X, y)) # 1.0
print(reg.coef_) # array([1., 2.])
print(reg.intercept_) # 3.0
print(reg.predict(np.array([[3, 5]]))) # array([16.])
1.2 Linear regression(using Gradient Descent)
- Algorithm \(h(w) = Xw \\ Minimize \ L(w) ∝ ||y - h(w)||^2_2 \\ w^{(k+1)} ← w^{(k)} - {\color{blue}{\text{eta0}}} \nabla_wL(w) \quad (\text{more complicated in implementation})\\\)
- Gradient 기반 최적화 알고리즘은 모두 feature의 scale에 영향을 큰 영향을 받기 때문에, scaling이 필수!
- early stopping 사용 가능
$\color{blue}{\text{early_stopping=True}}$ (default=False)
($\color{blue}{\text{validation_fraction=0.1}}$: default validation size) - warm start 사용 가능
$\color{blue}{\text{warm_start=True}}$ (default=False)
1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
n_samples, n_features = 10, 5
X = np.random.randn(n_samples, n_features)
y = np.random.randn(n_samples)
# Always scale the input. The most convenient way is to use a pipeline.
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, eta0=0.01))
reg.fit(X, y)
1.3 Ridge regression(using cholesky decomposition)
- Algorithm \(h(w) = Xw \\ Minimize \ L(w) = ||y - h(w)||^2_2 + {\color{blue}{\text{alpha}}} ||w||^2_2 \quad (w_0 \text{(bias) is not regularized}) \\ \hat w = (X'X + {\color{blue}{\text{alpha}}} A)^{-1}X'Y \\ (A = \text{Identity matrix with 0 in the first diagonal element}(w_0))\)
- $w$ 자체에 값에 영향을 받기 때문에 feature의 scaling이 필수!
1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
from sklearn.linear_model import Ridge, SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
n_samples, n_features = 10, 5
X = np.random.randn(n_samples, n_features)
y = np.random.randn(n_samples)
model = Ridge(alpha=1.0)
# model = SGDRegressor() # default penalty: l2
full_model = make_pipeline(StandardScaler(), model)
full_model.fit(X, y)
1.4 Lasso regression(Least absolute shrinkage and selection operator)
- Algorithm \(h(w) = Xw \\ Minimize \ L(w) = \frac{1}{2n} ||y - h(w)||^2_2 + {\color{blue}{\text{alpha}}} ||w||_1 \quad (w_0 \text{(bias) is not regularized}) \\\)
- Feature selection(최대 $n$개), sparse model을 생성(${\color{blue}{\text{alpha}}}$가 클수록 sparsity가 강해짐)
- $w_i=0$ 인 부분에서 미분불가능하기 때문에(subgradient 사용) 학습 궤적이 진동함
- $w$ 자체에 값에 영향을 받기 때문에 feature의 scaling이 필수!
1
2
3
4
5
6
7
8
9
10
11
from sklearn.linear_model import Lasso, SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
model = Lasso(alpha=0.1) # use coordinate descent
# model = SGDRegressor(penalty='l1')
full_model = make_pipeline(StandardScaler(), model)
full_model.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
print(model.coef_) # [0.85 0. ]
print(model.intercept_) # 0.15
1.5 Elastic net
- Algorithm \(h(w) = Xw \\ Minimize \ L(w) = \frac{1}{2n} ||y - h(w)||^2_2 + {\color{blue}{\text{alpha}}} ({\color{blue}{\text{l1_ratio}}} \cdot ||w||_1 + (1 - {\color{blue}{\text{l1_ratio}}}) \cdot \frac{1}{2} ||w||^2_2) \quad (w_0 \text{ is not regularized}) \\\)
- Ridge vs Lasso vs Elastic net
- Ridge regression: 기본 model
- Lasso regression: Feature selection이 필요, $d \leq n$
- Elastic net: Feature selection이 필요, $d > n$
- $w$ 자체에 값에 영향을 받기 때문에 feature의 scaling이 필수!
1
2
3
4
5
6
7
8
9
10
11
from sklearn.linear_model import Lasso, SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
model = Lasso(alpha=0.1) # use coordinate descent
# model = SGDRegressor(penalty='l1')
full_model = make_pipeline(StandardScaler(), model)
full_model.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
print(model.coef_) # [0.85 0. ]
print(model.intercept_) # 0.15
2. Logistic regression
- Algorithm \(\hat y = \begin{cases} 0 & \quad \text{if } h(w) < 0.5 \\ 1 & \quad \text{if } h(w) \geq 0.5 \end{cases} \\ h(w) = p = \sigma(z) \\ z = Xw \quad (\text{a.k.a. logit, log-odds})\\ \sigma(z) = \frac{1}{1 + e^{-z}} \\ Minimize \ L(w) = \frac{1}{n}\sum_i \big[ y_i (-log(p_i)) + (1-y_i)(-log(1-p_i)) \big] \\ \nabla_wL(w) = \frac{1}{n} X'(P - Y) \\ w^{(k+1)} ← w^{(k)} - {\color{blue}{\text{eta0}}} \nabla_wL(w) \quad (\text{more complicated in implementation}) \\\)
- 기본적으론
penalty='l2'이나,penalty='elasticnet'으로 다음 parameter 사용가능- $\color{blue}{C (\propto \frac{1}{alpha})}$
- $\color{blue}{\text{l1_ratio}}$
1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
y = (y == 2).astype(int)
clf = LogisticRegression(random_state=0, n_jobs=-1).fit(X, y)
clf.predict(X[:2, :]) # array([0, 0])
clf.predict_proba(X[:2, :])
# array([[9.99998823e-01, 1.17651820e-06],
# [9.99998354e-01, 1.64562311e-06]])
clf.score(X, y) # 0.97
3. Softmax regression(multinomial logistic regression)
- Algorithm \(\hat y = argmax_c h_c(w) \\ h_c(w) = p_c = \sigma(z_c) \\ z_c = Xw_c \quad (\text{a.k.a. logit, log-odds})\\ \sigma(z_c) = \frac{exp(z_c)}{\sum_c exp(z_c)} \\ Minimize \ L(w) = \frac{1}{n} \sum_i \sum_c \big[ y_{ic} (-log \hat p_{ic}) \big] \\ \nabla_wL(w) = \frac{1}{n} X'(P - Y) \\ w^{(k+1)} ← w^{(k)} - {\color{blue}{\text{eta0}}} \nabla_wL(w) \quad (\text{more complicated in implementation}) \\\)
1
2
3
4
5
6
7
8
9
10
11
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, n_jobs=-1).fit(X, y)
clf.predict(X[:2, :]) # array([0, 0])
clf.predict_proba(X[:2, :])
# array([[9.8...e-01, 1.8...e-02, 1.4...e-08],
# [9.7...e-01, 2.8...e-02, ...e-08]])
clf.score(X, y) # 0.97
4. Polynomial regression
sklearn.preprocessing.PolynomialFeatures를 사용e.g. degree=2 (a, b) → (1, a, b, a^2, ab, b^2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
X = np.arange(6).reshape(3, 2) # features: a, b
y = np.arange(3)
poly = PolynomialFeatures(2) # degree=2
poly.fit_transform(X) # features: 1, a, b, a^2, ab, b^2
# array([[ 1., 0., 1., 0., 0., 1.],
# [ 1., 2., 3., 4., 6., 9.],
# [ 1., 4., 5., 16., 20., 25.]])
poly = PolynomialFeatures(interaction_only=True)
X_poly = poly.fit_transform(X) # features: 1, a, b, ab
X_poly
# array([[ 1., 0., 1., 0.],
# [ 1., 2., 3., 6.],
# [ 1., 4., 5., 20.]])
reg = make_pipeline(PolynomialFeatures(), StandardScaler(), SGDRegressor())
reg.fit(X, y)
5. Support Vector Machine(SVM)
5.1 Classification
- 중간 크기의 데이터셋에 대한 복잡한 분류 문제에 적합
- Large margin classification 이라 불리기도 함
- Feature의 scaling이 필수!
kernel=linear의 경우,sklearn.svm.LinearSVC를 사용LinearSVC: $O(n \times d)$SVC: $O(n^2 \times d)$ ~ $O(n^3 \times d)$
- Parameters
- $\color{blue}kernel$:
poly,rbf(default),sigmoid, etcpoly: $K(a, b) = ({\color{blue}gamma} \cdot a’b + {\color{blue}coef0})^{\color{blue}degree}$rbf: $K(a, b) = exp(-{\color{blue}gamma} \cdot \mid \mid a-b \mid \mid^2)$sigmoid: $K(a, b) = tanh({\color{blue}gamma} \cdot a’b + {\color{blue}coef0})$
- $\color{blue}C$: 규제강도와 반비례
- $\color{blue}gamma$: kernel coefficient (
scale(default) orauto), 규제강도와 반비례scale: $\frac{1}{d \times Var(X)}$auto: $\frac{1}{d}$
- $\color{blue}probability=True$:
predict_proba()를 지원 (cross validation)
- $\color{blue}kernel$:
- Algorithm \(Minimize \ L(w) = \frac{1}{2} w'w + {\color{blue}C} \sum_i \zeta_i \\ s.t. \ y_i(w' \phi(x_i) + b) \geq 1 - \zeta_i, \quad \zeta_i \geq 0, \ i=1, ..., n \\ (K(a, b) = \phi(a)' \phi(b))\)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
model = LinearSVC()
# model = SVC(kernel='poly', degree=3, coef0=1, C=5, probability=True)
# model = SVC(kernel='rbf', gamma=5, C=0.001)
full_model = make_pipeline(StandardScaler(), model)
full_model.fit(X, y)
full_model.predict([[-0.8, -1]])
5.2 Regression
- Feature의 scaling이 필수!
- Parameters
- $\color{blue}epsilon$: Margin의 폭
- Algorithm \(Minimize \ L(w) = \frac{1}{2} w'w + {\color{blue}C} \sum_i (\zeta_i + \zeta_i^*) \\ s.t. -({\color{blue}epsilon} + \zeta_i^*) \leq y_i - (w' \phi(x_i) + b) \leq {\color{blue}epsilon} + \zeta_i, \\ \zeta_i, \zeta_i^* \geq 0, \ i=1, ..., n\)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.svm import LinearSVR, SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, random_state=0)
model = LinearSVR(random_state=0, tol=1e-5)
# model = SVR(C=1.0, epsilon=0.2)
regr = make_pipeline(StandardScaler(), model)
regr.fit(X, y)
# Pipeline(steps=[('standardscaler', StandardScaler()),
# ('linearsvr', LinearSVR(random_state=0, tol=1e-05))])
print(regr.named_steps['linearsvr'].coef_) # [18.582... 27.023... 44.357... 64.522...]
print(regr.named_steps['linearsvr'].intercept_) # [-4...]
print(regr.predict([[0, 0, 0, 0]])) # [-2.384...]
6. Decision Tree
6.1 Classification
- Algorithm \(\text{Split node into two node(CART) with feature } k, \text{threshold } t_k, \\ Minimize \ L(k, t_k) = \frac{n_\text{left}}{n_\text{left} + n_\text{right}} G_\text{left} + \frac{n_\text{right}}{n_\text{left} + n_\text{right}} G_\text{right} \\ \quad \\ G(\text{Gini index}) = 1 - \sum_c p_c^2 \\ G(\text{Entropy}) = \sum_c \bigg[ p_c \log \frac{1}{p_c} \bigg] \\\)
- Parameters
min_값을 올리고,max_값을 내리면 regularization이 강해짐criterion: 분할 기준값 (gini(default),entropy)min_samples_split: 분할될 수 있는 node의 최소 sample 개수min_samples_leaf: leaf node가 가져야 하는 최소 sample 개수min_weight_fraction_leaf:min_samples_leaf의 fraction versionmax_leaf_nodes: leaf node의 최대 개수max_features: 사용할 feature의 최대 개수
- 단점
- 축과 수직한 decision boundary만을 생성하기 때문에, 데이터의 회전에 민감함
→ PCA 사용 - 학습 데이터에 있는 작은 변화에도 매우 민감
→ Ensemble(Random forest) 사용
- 축과 수직한 decision boundary만을 생성하기 때문에, 데이터의 회전에 민감함
1
2
3
4
5
6
7
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)
6.2 Regression
- Algorithm \(\text{Split node into two node(CART) with feature } k, \text{threshold } t_k, \\ Minimize \ L(k, t_k) = \frac{n_\text{left}}{n_\text{left} + n_\text{right}} G_\text{left} + \frac{n_\text{right}}{n_\text{left} + n_\text{right}} G_\text{right} \\ \quad \\ G(\text{MSE}) = \sum_{i \in \text{node}} \big( \hat y_\text{node} - y_i \big)^2 \\ \hat y_\text{node} = \frac{1}{n_\text{node}} \sum_{i \in \text{node}} y_i\)
- Parameters
criterion: 분할 기준값 (mse(default),map,poisson, etc)
1
2
3
4
5
6
7
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
X, y = load_diabetes(return_X_y=True)
regressor = DecisionTreeRegressor(random_state=0)
cross_val_score(regressor, X, y, cv=10)
7. Ensemble model
- Weak learner도 충분히 많고 다양하다면 ensemble은 strong learner가 될 수 있음
e.g. 완벽하게 독립적이고 오차에 상관관계가 없는 accuracy 51% classifier 1000개에 대하여 hard voting을 적용하면, 약 75%의 정확도를 기대할 수 있음
1
2
3
from scipy.stats import binom
1 - binom.cdf(499, 1000, 0.51) # 0.75
- 예측기들이 서로 독립적일 때, 최고의 ensemble 성능을 발휘
→ 서로 다른 알고리즘으로 학습 (서로 다른 종류의 오차를 생성) - 일반적으로 ensemble의 결과는 base model과 비교하여, bias는 비슷, variance는 감소함
→ 학습 데이터에 대한 오차는 비슷하나, decision boundary는 덜 불규칙함(매끄러움)
→ 일반화 성능 향상!
7.1 Voting
- Algorithm
- Hard voting:
predict()에 대한 다수결 투표 (classification) - Soft voting:
predict_proba()에 대한 argmax (classification) - Averaging:
predict()의 평균값 (regression)
- Hard voting:
- 일반적으로 확률 기반 방식인 soft voting의 성능이 더 좋음
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()
# clf4 = SVC(probability=True)
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')
eclf1 = eclf1.fit(X, y)
print(eclf1.predict(X)) # [1 1 1 2 2 2]
np.array_equal(eclf1.named_estimators_.lr.predict(X), eclf1.named_estimators_['lr'].predict(X)) # True
eclf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft')
eclf2 = eclf2.fit(X, y)
print(eclf2.predict(X)) # [1 1 1 2 2 2]
eclf3 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft', weights=[2,1,1], flatten_transform=True)
eclf3 = eclf3.fit(X, y)
print(eclf3.predict(X)) # [1 1 1 2 2 2]
print(eclf3.transform(X).shape) # (6, 6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import VotingRegressor
X, y = load_diabetes(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
base_models = [
LinearRegression(n_jobs=-1),
SVR(),
RandomForestRegressor(n_jobs=-1),
KNeighborsRegressor(n_jobs=-1),
]
model = VotingRegressor([(model.__class__.__name__, model) for model in base_models], n_jobs=-1)
model.fit(X, y)
7.2 Bagging(Bootstrap aggregating)
- Bagging vs Pasting
Bootstrapping(복원추출)은 학습 데이터의 subset에 다양성을 증가시킴
→ Bias: bagging > pasting
→ Variance: bagging < pasting (예측기들의 상관관계를 감소) - oob score
1) 학습 데이터의 전체 개수만큼 복원추출 시, 평균적으로 각 예측기는 학습 데이터의 63.2%만 사용
→ $n$개의 학습 데이터 중 한 sample이 $n$번 복원추출하는 동안 선택되지 않을 확률은 약 36.8% \(\lim_{n → \infty} \bigg( 1 - \frac{1}{n} \bigg)^n \\ = \lim_{t → 0} \bigg[ \bigg( 1 + t \bigg)^\frac{1}{t} \bigg]^{-1} \\ = e^{-1} \\ ≈ 0.368\) 2) 각 예측기마다 사용되지 않은 37%의 데이터를 out-of-bag(oob) sample이라 부름 3)oob_score=Trueoption을 통해 각 예측기의 oob 평가의 평균값을 얻을 수 있음 (validation score) \
1
2
3
4
5
6
from sklearn.ensemble import BaggingClassifier
clf = BaggingClassifier(oob_score=True, ...)
clf.fit(X, y)
print(clf.oob_score_)
print(clf.oob_decision_function_)
- Parameters
max_samples: subset의 크기bootstrap: bagging or pasting
-
Feature sampling 1)
max_features,bootstrap_featuresoption을 통해 feature sampling 가능 2) Random patches method: data, feature 모두 sampling하는 방법
Random subspace method: feature만 sampling하는 방법 - Algorithm
학습 데이터의 subset을 무작위로 구성하여 모델들을 학습하고hard(default, classification) orsoft(classification) voting or averaging(default, regression) - Soft Voting
모든 estimator가predict_proba()method를 가지고 있다면softvoting을 수행
1
2
3
4
5
6
7
8
9
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
clf = BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0).fit(X, y)
clf.oob_score_
1
2
3
4
5
6
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=4, n_informative=2, n_targets=1, random_state=0, shuffle=False)
regr = BaggingRegressor(base_estimator=SVR(), n_estimators=10, random_state=0).fit(X, y)
7.3 Pasting
- Algorithm
중복을 허용하지 않고 학습 데이터의 subset을 생성하여 예측값을 집계 - Parameters
Bagging과 동일한 API(sklearn.ensemble.Bagging-)를 공유하며,bootstrap=Falseoption을 통해 pasting을 사용할 수 있음
1
2
3
4
5
6
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=4, n_informative=2, n_targets=1, random_state=0, shuffle=False)
regr = BaggingRegressor(base_estimator=SVR(), n_estimators=10, random_state=0, bootstrap=False).fit(X, y)
7.4 Boosting
- Algorithm
Weak learner를 sequential하게 연결하고 각 learner가 이전의 learner를 보완하여 strong learner를 만드는 ensemble algorithm
7.4.1 Adaboost(Adaptive boosting)
- Algorithm
- 이전 model이 underfitting 했던 sample의 가중치를 더 높여 학습
- sample의 가중치: 큰 가중치를 가진 sample은 개수를 늘리는 것으로 간주할 수 있음
- 학습된 model들의 예측값에 가중치를 반영하여 최종 예측값을 반환(voting or average)
- 이전 model이 underfitting 했던 sample의 가중치를 더 높여 학습
- 기반 예측기(weak learner)
- 일반적으로,
max_depth=1decision tree가 사용 - SVM은 속도가 느리고 adaboost 알고리즘과 함께 사용했을 때 불안정한 경향이 있음
- 일반적으로,
- Parameters
learning_rate: 가중치 update의 세기 (GD의 learning rate와 유사)
sklearn.ensemble.AdaBoostClassifier: Classification
sklearn.ensemble.AdaBoostRegressor: Regression
1
2
3
4
5
6
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)
clf = AdaBoostClassifier(n_estimators=100, random_state=0)
clf.fit(X, y)
7.4.2 Gradient boosting
- Algorithm
이전 예측기의 residual error를 input으로 학습하여 learner를 보강 \(\text{Input}: X, y \\ \text{Model }1: \hat y_1 = f_1(X) \quad → \quad r^{(1)} = y - \hat y_1 \\ \text{Model }2: \hat r_2^{(1)} = f_2(X) \quad → \quad r^{(2)} = r^{(1)} - \hat r_2^{(1)} \\ \vdots \\ \text{Model }k: \hat r_k^{(k-1)} = f_k(X) \quad → \quad r^{(k)} = r^{(k-1)} - \hat r_k^{(k-1)} \\ \quad \\ \text{Prediction}: \hat y = \sum_i f_i(X)\) - 기반 예측기(weak learner)
Decision tree로 고정 (classifier, regressor) - Parameters
learning_rate: 각 tree의 기여 정도subsample: 각 tree가 학습할 때 사용할 학습 데이터의 비율
- Shrinkage
learning_rate로 낮은 값을 사용하면 많은 tree가 필요하지만, 일반적으로 예측의 성능이 향상됨 이를 regularization의 일종인, shrinkage(축소) 라 한다. - Early stopping
staged_predict()(각 훈련 단계마다 학습된 모델의 예측값에 대한 iterator를 반환)을 사용하여 간단히 구현가능 - Stochastic gradient boosting
subsample을 작은 값으로 사용 시, 학습 속도 증가, bias 감소, variance 상승- 이러한 기법을 stochastic gradient boosting이라 한다.
- XGBoost
최적화된 gradient boosting의 구현 package
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)
scores = [accuracy_score(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
bst_n_estimators = np.argmax(scores) + 1
best_clf = GradientBoostingClassifier(n_estimators=bst_n_estimators, learning_rate=1.0, max_depth=1, random_state=0)
7.5 Stacking(Stacked generalization)
- Algorithm
- 예측기(base model)들의 예측값들을 input으로 받는 meta-learner(blender)를 사용
- Base model들을 학습시키기 위한 데이터와 meta-learner를 학습시키기 위한 데이터가 구별되어야 함
- Out-of-fold: k-fold cross validation 사용 (stacking)
- Hold-out: 데이터를 두 개의 subset으로 분리하여 사용 (blending)
- 여러 개의 blender를 layer로 쌓아서 학습시키는 것도 가능
from sklearn.ensemble import StackingClassifier: Classifier
from sklearn.ensemble import StackingRegressor: Regressor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
estimators = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('svr', make_pipeline(StandardScaler(), LinearSVC(random_state=42)))
]
clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
clf.fit(X_train, y_train).score(X_test, y_test)
8. Bagging with Decision Tree
8.1 Random forest
- Algorithm
- Decision tree에 bagging(or pasting) 적용
- 전체 feature들 중 무작위로 선택한 feature 후보 중에서 최적의 feature를 선택하는 무작위성 추가(random)
- $\forall k \in K L(k, t_k) \quad \mid K \mid = {\color{blue}{\text{max_features}}}$
- Bias 증가, Variance 감소
DecisionTree-와 마찬가지로feature_importances_계산 가능
몇 개의 feature들만 사용하는 decision tree와는 달리 random forest는 대부분의 feature을 평가할 수 있음
1
2
3
4
5
6
7
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
for feat, score in zip(features, rnd_clf.feature_importances_):
print(name, score)
rnd_clf.predict(X_test)
8.2 Extra tree(Extra random tree)
- Algorithm
무작위로 feature($k$)를 선택(random forest)할 뿐만 아니라, threshold($t_k$)까지 무작위로 선택한 후, 최상의 분할을 선택
→ Optimal $t_k$를 찾는 과정이 없어져 학습 속도가 매우 빠름 - Bias 증가, Variance 감소
sklearn.ensemble.RandomForest와 동일한 option으로 사용가능sklearn.ensemble.ExtraTrees-: ensemble modelsklearn.ensemble.ExtraTree-: single model
1
2
3
4
5
6
7
from sklearn.ensemble import ExtraTreesClassifier
rnd_clf = ExtraTreesClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
for feat, score in zip(features, rnd_clf.feature_importances_):
print(name, score)
rnd_clf.predict(X_test)
PREVIOUSEtc