Sklearn Estimators

Remarks

Scikit-learn에 구현된 model들을 정리한 글입니다.

Notation

$n$: number of data
$d$: dimension of features
$c$: number of classes
$X$: features
$y$: target
$\color{blue}{parameter}$: parameter of estimator

1. Linear regression

1.1 Linear regression(using SVD)

Algorithm $h(w) = Xw \\ Minimize \ L(w) ∝ ||y - h(w)||^2_2 \\ \hat w = (X'X)^{-1}X'y ≈ X^+y \\$
Time complexity: $O(nd^2 + d^3)$
Feature가 많은 경우 사용하기 어려움

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression(n_jobs=-1).fit(X, y)

print(reg.score(X, y))  # 1.0
print(reg.coef_)        # array([1., 2.])
print(reg.intercept_)   # 3.0
print(reg.predict(np.array([[3, 5]])))  # array([16.])

1.2 Linear regression(using Gradient Descent)

Algorithm $h(w) = Xw \\ Minimize \ L(w) ∝ ||y - h(w)||^2_2 \\ w^{(k+1)} ← w^{(k)} - {\color{blue}{\text{eta0}}} \nabla_wL(w) \quad (\text{more complicated in implementation})\\$
Gradient 기반 최적화 알고리즘은 모두 feature의 scale에 영향을 큰 영향을 받기 때문에, scaling이 필수!
early stopping 사용 가능
$\color{blue}{\text{early_stopping=True}}$ (default=False)
($\color{blue}{\text{validation_fraction=0.1}}$: default validation size)
warm start 사용 가능
$\color{blue}{\text{warm_start=True}}$ (default=False)

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

n_samples, n_features = 10, 5
X = np.random.randn(n_samples, n_features)
y = np.random.randn(n_samples)

# Always scale the input. The most convenient way is to use a pipeline.
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, eta0=0.01))
reg.fit(X, y)

1.3 Ridge regression(using cholesky decomposition)

Algorithm $h(w) = Xw \\ Minimize \ L(w) = ||y - h(w)||^2_2 + {\color{blue}{\text{alpha}}} ||w||^2_2 \quad (w_0 \text{(bias) is not regularized}) \\ \hat w = (X'X + {\color{blue}{\text{alpha}}} A)^{-1}X'Y \\ (A = \text{Identity matrix with 0 in the first diagonal element}(w_0))$
$w$ 자체에 값에 영향을 받기 때문에 feature의 scaling이 필수!

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
from sklearn.linear_model import Ridge, SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

n_samples, n_features = 10, 5
X = np.random.randn(n_samples, n_features)
y = np.random.randn(n_samples)

model = Ridge(alpha=1.0)
# model = SGDRegressor()  # default penalty: l2
full_model = make_pipeline(StandardScaler(), model)
full_model.fit(X, y)

1.4 Lasso regression(Least absolute shrinkage and selection operator)

Algorithm $h(w) = Xw \\ Minimize \ L(w) = \frac{1}{2n} ||y - h(w)||^2_2 + {\color{blue}{\text{alpha}}} ||w||_1 \quad (w_0 \text{(bias) is not regularized}) \\$
Feature selection(최대 $n$개), sparse model을 생성(${\color{blue}{\text{alpha}}}$가 클수록 sparsity가 강해짐)
$w_i=0$ 인 부분에서 미분불가능하기 때문에(subgradient 사용) 학습 궤적이 진동함
$w$ 자체에 값에 영향을 받기 때문에 feature의 scaling이 필수!

1
2
3
4
5
6
7
8
9
10
11
from sklearn.linear_model import Lasso, SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

model = Lasso(alpha=0.1)  # use coordinate descent
# model = SGDRegressor(penalty='l1')
full_model = make_pipeline(StandardScaler(), model)

full_model.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
print(model.coef_)       # [0.85 0.  ]
print(model.intercept_)  # 0.15

1.5 Elastic net

Algorithm $h(w) = Xw \\ Minimize \ L(w) = \frac{1}{2n} ||y - h(w)||^2_2 + {\color{blue}{\text{alpha}}} ({\color{blue}{\text{l1_ratio}}} \cdot ||w||_1 + (1 - {\color{blue}{\text{l1_ratio}}}) \cdot \frac{1}{2} ||w||^2_2) \quad (w_0 \text{ is not regularized}) \\$
Ridge vs Lasso vs Elastic net
1. Ridge regression: 기본 model
2. Lasso regression: Feature selection이 필요, $d \leq n$
3. Elastic net: Feature selection이 필요, $d > n$
$w$ 자체에 값에 영향을 받기 때문에 feature의 scaling이 필수!

1
2
3
4
5
6
7
8
9
10
11
from sklearn.linear_model import Lasso, SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

model = Lasso(alpha=0.1)  # use coordinate descent
# model = SGDRegressor(penalty='l1')
full_model = make_pipeline(StandardScaler(), model)
full_model.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])

print(model.coef_)       # [0.85 0.  ]
print(model.intercept_)  # 0.15

2. Logistic regression

Algorithm $\hat y = \begin{cases} 0 & \quad \text{if } h(w) < 0.5 \\ 1 & \quad \text{if } h(w) \geq 0.5 \end{cases} \\ h(w) = p = \sigma(z) \\ z = Xw \quad (\text{a.k.a. logit, log-odds})\\ \sigma(z) = \frac{1}{1 + e^{-z}} \\ Minimize \ L(w) = \frac{1}{n}\sum_i \big[ y_i (-log(p_i)) + (1-y_i)(-log(1-p_i)) \big] \\ \nabla_wL(w) = \frac{1}{n} X'(P - Y) \\ w^{(k+1)} ← w^{(k)} - {\color{blue}{\text{eta0}}} \nabla_wL(w) \quad (\text{more complicated in implementation}) \\$
기본적으론 penalty='l2' 이나, penalty='elasticnet'으로 다음 parameter 사용가능
- $\color{blue}{C (\propto \frac{1}{alpha})}$
- $\color{blue}{\text{l1_ratio}}$

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)
y = (y == 2).astype(int)
clf = LogisticRegression(random_state=0, n_jobs=-1).fit(X, y)

clf.predict(X[:2, :])  # array([0, 0])
clf.predict_proba(X[:2, :])
# array([[9.99998823e-01, 1.17651820e-06],
#        [9.99998354e-01, 1.64562311e-06]])
clf.score(X, y)  # 0.97

3. Softmax regression(multinomial logistic regression)

Algorithm $\hat y = argmax_c h_c(w) \\ h_c(w) = p_c = \sigma(z_c) \\ z_c = Xw_c \quad (\text{a.k.a. logit, log-odds})\\ \sigma(z_c) = \frac{exp(z_c)}{\sum_c exp(z_c)} \\ Minimize \ L(w) = \frac{1}{n} \sum_i \sum_c \big[ y_{ic} (-log \hat p_{ic}) \big] \\ \nabla_wL(w) = \frac{1}{n} X'(P - Y) \\ w^{(k+1)} ← w^{(k)} - {\color{blue}{\text{eta0}}} \nabla_wL(w) \quad (\text{more complicated in implementation}) \\$

1
2
3
4
5
6
7
8
9
10
11
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, n_jobs=-1).fit(X, y)

clf.predict(X[:2, :])  # array([0, 0])
clf.predict_proba(X[:2, :])
# array([[9.8...e-01, 1.8...e-02, 1.4...e-08],
#        [9.7...e-01, 2.8...e-02, ...e-08]])
clf.score(X, y)  # 0.97

4. Polynomial regression

sklearn.preprocessing.PolynomialFeatures를 사용

e.g. degree=2
(a, b) → (1, a, b, a^2, ab, b^2)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

X = np.arange(6).reshape(3, 2)  # features: a, b
y = np.arange(3)
poly = PolynomialFeatures(2)  # degree=2
poly.fit_transform(X)  # features: 1, a, b, a^2, ab, b^2
# array([[ 1.,  0.,  1.,  0.,  0.,  1.],
#        [ 1.,  2.,  3.,  4.,  6.,  9.],
#        [ 1.,  4.,  5., 16., 20., 25.]])
poly = PolynomialFeatures(interaction_only=True)
X_poly = poly.fit_transform(X)  # features: 1, a, b, ab
X_poly
# array([[ 1.,  0.,  1.,  0.],
#        [ 1.,  2.,  3.,  6.],
#        [ 1.,  4.,  5., 20.]])
reg = make_pipeline(PolynomialFeatures(), StandardScaler(), SGDRegressor())
reg.fit(X, y)

5. Support Vector Machine(SVM)

5.1 Classification

중간 크기의 데이터셋에 대한 복잡한 분류 문제에 적합
Large margin classification 이라 불리기도 함
Feature의 scaling이 필수!
kernel=linear의 경우, sklearn.svm.LinearSVC를 사용
- LinearSVC: $O(n \times d)$
- SVC: $O(n^2 \times d)$ ~ $O(n^3 \times d)$
Parameters
- $\color{blue}kernel$: poly, rbf(default), sigmoid, etc
  - poly: $K(a, b) = ({\color{blue}gamma} \cdot a’b + {\color{blue}coef0})^{\color{blue}degree}$
  - rbf: $K(a, b) = exp(-{\color{blue}gamma} \cdot \mid \mid a-b \mid \mid^2)$
  - sigmoid: $K(a, b) = tanh({\color{blue}gamma} \cdot a’b + {\color{blue}coef0})$
- $\color{blue}C$: 규제강도와 반비례
- $\color{blue}gamma$: kernel coefficient (scale(default) or auto), 규제강도와 반비례
  - scale: $\frac{1}{d \times Var(X)}$
  - auto: $\frac{1}{d}$
- $\color{blue}probability=True$: predict_proba()를 지원 (cross validation)
Algorithm $Minimize \ L(w) = \frac{1}{2} w'w + {\color{blue}C} \sum_i \zeta_i \\ s.t. \ y_i(w' \phi(x_i) + b) \geq 1 - \zeta_i, \quad \zeta_i \geq 0, \ i=1, ..., n \\ (K(a, b) = \phi(a)' \phi(b))$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC

X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])

model = LinearSVC()
# model      = SVC(kernel='poly', degree=3, coef0=1, C=5, probability=True)
# model      = SVC(kernel='rbf', gamma=5, C=0.001)
full_model = make_pipeline(StandardScaler(), model)
full_model.fit(X, y)
full_model.predict([[-0.8, -1]])

5.2 Regression

Feature의 scaling이 필수!
Parameters
- $\color{blue}epsilon$: Margin의 폭
Algorithm $Minimize \ L(w) = \frac{1}{2} w'w + {\color{blue}C} \sum_i (\zeta_i + \zeta_i^*) \\ s.t. -({\color{blue}epsilon} + \zeta_i^*) \leq y_i - (w' \phi(x_i) + b) \leq {\color{blue}epsilon} + \zeta_i, \\ \zeta_i, \zeta_i^* \geq 0, \ i=1, ..., n$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.svm import LinearSVR, SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression

X, y = make_regression(n_features=4, random_state=0)
model = LinearSVR(random_state=0, tol=1e-5)
# model = SVR(C=1.0, epsilon=0.2)
regr = make_pipeline(StandardScaler(), model)

regr.fit(X, y)
# Pipeline(steps=[('standardscaler', StandardScaler()),
#                 ('linearsvr', LinearSVR(random_state=0, tol=1e-05))])

print(regr.named_steps['linearsvr'].coef_)       # [18.582... 27.023... 44.357... 64.522...]
print(regr.named_steps['linearsvr'].intercept_)  # [-4...]
print(regr.predict([[0, 0, 0, 0]]))              # [-2.384...]

6. Decision Tree

6.1 Classification

Algorithm $\text{Split node into two node(CART) with feature } k, \text{threshold } t_k, \\ Minimize \ L(k, t_k) = \frac{n_\text{left}}{n_\text{left} + n_\text{right}} G_\text{left} + \frac{n_\text{right}}{n_\text{left} + n_\text{right}} G_\text{right} \\ \quad \\ G(\text{Gini index}) = 1 - \sum_c p_c^2 \\ G(\text{Entropy}) = \sum_c \bigg[ p_c \log \frac{1}{p_c} \bigg] \\$
Parameters min_ 값을 올리고, max_ 값을 내리면 regularization이 강해짐
- criterion: 분할 기준값 (gini(default), entropy)
- min_samples_split: 분할될 수 있는 node의 최소 sample 개수
- min_samples_leaf: leaf node가 가져야 하는 최소 sample 개수
- min_weight_fraction_leaf: min_samples_leaf의 fraction version
- max_leaf_nodes: leaf node의 최대 개수
- max_features: 사용할 feature의 최대 개수
단점
1. 축과 수직한 decision boundary만을 생성하기 때문에, 데이터의 회전에 민감함
  → PCA 사용
2. 학습 데이터에 있는 작은 변화에도 매우 민감
  → Ensemble(Random forest) 사용

1
2
3
4
5
6
7
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)

6.2 Regression

Algorithm $\text{Split node into two node(CART) with feature } k, \text{threshold } t_k, \\ Minimize \ L(k, t_k) = \frac{n_\text{left}}{n_\text{left} + n_\text{right}} G_\text{left} + \frac{n_\text{right}}{n_\text{left} + n_\text{right}} G_\text{right} \\ \quad \\ G(\text{MSE}) = \sum_{i \in \text{node}} \big( \hat y_\text{node} - y_i \big)^2 \\ \hat y_\text{node} = \frac{1}{n_\text{node}} \sum_{i \in \text{node}} y_i$
Parameters
- criterion: 분할 기준값 (mse(default), map, poisson, etc)

1
2
3
4
5
6
7
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

X, y = load_diabetes(return_X_y=True)
regressor = DecisionTreeRegressor(random_state=0)
cross_val_score(regressor, X, y, cv=10)

7. Ensemble model

Weak learner도 충분히 많고 다양하다면 ensemble은 strong learner가 될 수 있음
e.g. 완벽하게 독립적이고 오차에 상관관계가 없는 accuracy 51% classifier 1000개에 대하여 hard voting을 적용하면, 약 75%의 정확도를 기대할 수 있음

1
2
3
    from scipy.stats import binom
    1 - binom.cdf(499, 1000, 0.51)  # 0.75

예측기들이 서로 독립적일 때, 최고의 ensemble 성능을 발휘
→ 서로 다른 알고리즘으로 학습 (서로 다른 종류의 오차를 생성)
일반적으로 ensemble의 결과는 base model과 비교하여, bias는 비슷, variance는 감소함
→ 학습 데이터에 대한 오차는 비슷하나, decision boundary는 덜 불규칙함(매끄러움)
→ 일반화 성능 향상!

7.1 Voting

Algorithm
- Hard voting: predict()에 대한 다수결 투표 (classification)
- Soft voting: predict_proba()에 대한 argmax (classification)
- Averaging: predict()의 평균값 (regression)
일반적으로 확률 기반 방식인 soft voting의 성능이 더 좋음

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC

clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()
# clf4 = SVC(probability=True)

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')
eclf1 = eclf1.fit(X, y)
print(eclf1.predict(X))  # [1 1 1 2 2 2]

np.array_equal(eclf1.named_estimators_.lr.predict(X), eclf1.named_estimators_['lr'].predict(X))  # True

eclf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft')
eclf2 = eclf2.fit(X, y)
print(eclf2.predict(X))  # [1 1 1 2 2 2]

eclf3 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft', weights=[2,1,1], flatten_transform=True)
eclf3 = eclf3.fit(X, y)
print(eclf3.predict(X))  # [1 1 1 2 2 2]
print(eclf3.transform(X).shape)  # (6, 6)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import VotingRegressor

X, y = load_diabetes(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val  = scaler.transform(X_val)

base_models = [
    LinearRegression(n_jobs=-1),
    SVR(),
    RandomForestRegressor(n_jobs=-1),
    KNeighborsRegressor(n_jobs=-1),
]
model = VotingRegressor([(model.__class__.__name__, model) for model in base_models], n_jobs=-1)
model.fit(X, y)

7.2 Bagging(Bootstrap aggregating)

Bagging vs Pasting
Bootstrapping(복원추출)은 학습 데이터의 subset에 다양성을 증가시킴
→ Bias: bagging > pasting
→ Variance: bagging < pasting (예측기들의 상관관계를 감소)
oob score 1) 학습 데이터의 전체 개수만큼 복원추출 시, 평균적으로 각 예측기는 학습 데이터의 63.2%만 사용
→ $n$개의 학습 데이터 중 한 sample이 $n$번 복원추출하는 동안 선택되지 않을 확률은 약 36.8% $\lim_{n → \infty} \bigg( 1 - \frac{1}{n} \bigg)^n \\ = \lim_{t → 0} \bigg[ \bigg( 1 + t \bigg)^\frac{1}{t} \bigg]^{-1} \\ = e^{-1} \\ ≈ 0.368$ 2) 각 예측기마다 사용되지 않은 37%의 데이터를 out-of-bag(oob) sample이라 부름 3) oob_score=True option을 통해 각 예측기의 oob 평가의 평균값을 얻을 수 있음 (validation score) \

1
2
3
4
5
6
    from sklearn.ensemble import BaggingClassifier
    clf = BaggingClassifier(oob_score=True, ...)
    clf.fit(X, y)
    print(clf.oob_score_)
    print(clf.oob_decision_function_)

Parameters
- max_samples: subset의 크기
- bootstrap: bagging or pasting
Feature sampling 1) max_features, bootstrap_features option을 통해 feature sampling 가능 2) Random patches method: data, feature 모두 sampling하는 방법
Random subspace method: feature만 sampling하는 방법
Algorithm
학습 데이터의 subset을 무작위로 구성하여 모델들을 학습하고 hard(default, classification) or soft(classification) voting or averaging(default, regression)
Soft Voting
모든 estimator가 predict_proba() method를 가지고 있다면 soft voting을 수행

1
2
3
4
5
6
7
8
9
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)
clf = BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0).fit(X, y)
clf.oob_score_

1
2
3
4
5
6
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=4, n_informative=2, n_targets=1, random_state=0, shuffle=False)
regr = BaggingRegressor(base_estimator=SVR(), n_estimators=10, random_state=0).fit(X, y)

7.3 Pasting

Algorithm
중복을 허용하지 않고 학습 데이터의 subset을 생성하여 예측값을 집계
Parameters
Bagging과 동일한 API(sklearn.ensemble.Bagging-)를 공유하며, bootstrap=False option을 통해 pasting을 사용할 수 있음

1
2
3
4
5
6
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=4, n_informative=2, n_targets=1, random_state=0, shuffle=False)
regr = BaggingRegressor(base_estimator=SVR(), n_estimators=10, random_state=0, bootstrap=False).fit(X, y)

7.4 Boosting

Algorithm
Weak learner를 sequential하게 연결하고 각 learner가 이전의 learner를 보완하여 strong learner를 만드는 ensemble algorithm

7.4.1 Adaboost(Adaptive boosting)

Algorithm
1. 이전 model이 underfitting 했던 sample의 가중치를 더 높여 학습
  - sample의 가중치: 큰 가중치를 가진 sample은 개수를 늘리는 것으로 간주할 수 있음
2. 학습된 model들의 예측값에 가중치를 반영하여 최종 예측값을 반환(voting or average)
기반 예측기(weak learner)
- 일반적으로, max_depth=1 decision tree가 사용
- SVM은 속도가 느리고 adaboost 알고리즘과 함께 사용했을 때 불안정한 경향이 있음
Parameters
- learning_rate: 가중치 update의 세기 (GD의 learning rate와 유사)
sklearn.ensemble.AdaBoostClassifier: Classification
sklearn.ensemble.AdaBoostRegressor: Regression

1
2
3
4
5
6
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)
clf = AdaBoostClassifier(n_estimators=100, random_state=0)
clf.fit(X, y)

7.4.2 Gradient boosting

Algorithm
이전 예측기의 residual error를 input으로 학습하여 learner를 보강 $\text{Input}: X, y \\ \text{Model }1: \hat y_1 = f_1(X) \quad → \quad r^{(1)} = y - \hat y_1 \\ \text{Model }2: \hat r_2^{(1)} = f_2(X) \quad → \quad r^{(2)} = r^{(1)} - \hat r_2^{(1)} \\ \vdots \\ \text{Model }k: \hat r_k^{(k-1)} = f_k(X) \quad → \quad r^{(k)} = r^{(k-1)} - \hat r_k^{(k-1)} \\ \quad \\ \text{Prediction}: \hat y = \sum_i f_i(X)$
기반 예측기(weak learner)
Decision tree로 고정 (classifier, regressor)
Parameters
- learning_rate: 각 tree의 기여 정도
- subsample: 각 tree가 학습할 때 사용할 학습 데이터의 비율
Shrinkage
learning_rate로 낮은 값을 사용하면 많은 tree가 필요하지만, 일반적으로 예측의 성능이 향상됨 이를 regularization의 일종인, shrinkage(축소) 라 한다.
Early stopping
staged_predict()(각 훈련 단계마다 학습된 모델의 예측값에 대한 iterator를 반환)을 사용하여 간단히 구현가능
Stochastic gradient boosting
- subsample을 작은 값으로 사용 시, 학습 속도 증가, bias 감소, variance 상승
- 이러한 기법을 stochastic gradient boosting이라 한다.
XGBoost
최적화된 gradient boosting의 구현 package

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)

scores = [accuracy_score(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
bst_n_estimators = np.argmax(scores) + 1
best_clf = GradientBoostingClassifier(n_estimators=bst_n_estimators, learning_rate=1.0, max_depth=1, random_state=0)

7.5 Stacking(Stacked generalization)

Algorithm
1. 예측기(base model)들의 예측값들을 input으로 받는 meta-learner(blender)를 사용
2. Base model들을 학습시키기 위한 데이터와 meta-learner를 학습시키기 위한 데이터가 구별되어야 함
  - Out-of-fold: k-fold cross validation 사용 (stacking)
  - Hold-out: 데이터를 두 개의 subset으로 분리하여 사용 (blending)
3. 여러 개의 blender를 layer로 쌓아서 학습시키는 것도 가능
from sklearn.ensemble import StackingClassifier: Classifier
from sklearn.ensemble import StackingRegressor: Regressor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svr', make_pipeline(StandardScaler(), LinearSVC(random_state=42)))
]
clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
clf.fit(X_train, y_train).score(X_test, y_test)

8. Bagging with Decision Tree

8.1 Random forest

Algorithm
1. Decision tree에 bagging(or pasting) 적용
2. 전체 feature들 중 무작위로 선택한 feature 후보 중에서 최적의 feature를 선택하는 무작위성 추가(random)
  - $\forall k \in K L(k, t_k) \quad \mid K \mid = {\color{blue}{\text{max_features}}}$
  - Bias 증가, Variance 감소
DecisionTree-와 마찬가지로 feature_importances_ 계산 가능
몇 개의 feature들만 사용하는 decision tree와는 달리 random forest는 대부분의 feature을 평가할 수 있음

1
2
3
4
5
6
7
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
for feat, score in zip(features, rnd_clf.feature_importances_):
   print(name, score)
rnd_clf.predict(X_test)

8.2 Extra tree(Extra random tree)

Algorithm
무작위로 feature($k$)를 선택(random forest)할 뿐만 아니라, threshold($t_k$)까지 무작위로 선택한 후, 최상의 분할을 선택
→ Optimal $t_k$를 찾는 과정이 없어져 학습 속도가 매우 빠름
Bias 증가, Variance 감소
sklearn.ensemble.RandomForest와 동일한 option으로 사용가능
- sklearn.ensemble.ExtraTrees-: ensemble model
- sklearn.ensemble.ExtraTree-: single model

1
2
3
4
5
6
7
from sklearn.ensemble import ExtraTreesClassifier

rnd_clf = ExtraTreesClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
for feat, score in zip(features, rnd_clf.feature_importances_):
   print(name, score)
rnd_clf.predict(X_test)

PREVIOUSEtc