Gradient
다변수 함수에 대한 일차 편미분 벡터입니다. Gradient의 방향은 함수값이 가장 크게 변하는 방향을 가리키고, gradient의 크기는 변화량의 크기를 나타냅니다.
Linear regression에서 MSE의 gradient를 구해보도록 하겠습니다.
1. $\hat{\mathbf{y}} = X \mathbf{\theta}$
\(\begin{pmatrix}
\hat{y}^{(1)} \\
\vdots \\
\hat{y}^{(m)}
\end{pmatrix} =
\begin{pmatrix}
& \mathbf{x^{(1)}}^T & \\
& \vdots & \\
& \mathbf{x^{(m)}}^T & \\
\end{pmatrix}
\begin{pmatrix}
\theta_0 \\
\vdots \\
\theta_n \\
\end{pmatrix}\)
2. $ MSE(\mathbf{\theta}) $
$ = \frac{1}{m}\sum_{i=1}^m(\hat{y}^{(i)} - y^{(i)})^2 = \frac{1}{m}\sum_{i=1}^m(\mathbf{x^{(i)}}^T \theta - y^{(i)})^2 $
3. $ \frac{\partial}{\partial \theta_j} MSE(\mathbf{\theta}) $
$ = \frac{2}{m} \sum_{i=1}^m(\mathbf{x^{(i)}}^T \theta - y^{(i)}) x^{(i)}_j $
4. $ \frac{\partial}{\partial \mathbf{\theta}} MSE(\mathbf{\theta}) = \nabla_{\theta}MSE(\theta) $
Let $ \frac{\partial}{\partial \theta_j} MSE(\mathbf{\theta}) = \sum_i \alpha_i x_j^{(i)} $
\(\frac{\partial}{\partial \mathbf{\theta}} MSE(\mathbf{\theta}) =
\begin{pmatrix}
\frac{\partial}{\partial \theta_0} MSE(\mathbf{\theta}) \\
\vdots \\
\frac{\partial}{\partial \theta_n} MSE(\mathbf{\theta}) \\
\end{pmatrix} =
\begin{pmatrix}
\sum_i \alpha_i x_0^{(i)} \\
\vdots \\
\sum_i \alpha_i x_n^{(i)} \\
\end{pmatrix} =
\alpha_1
\begin{pmatrix}
x_0^{(1)} \\
\vdots \\
x_n^{(1)} \\
\end{pmatrix} + \cdots +
\alpha_1
\begin{pmatrix}
x_0^{(m)} \\
\vdots \\
x_n^{(m)} \\
\end{pmatrix} =
\sum_i \alpha_i \mathbf{x}^{(i)}\)
\(= \frac{2}{m} \sum_{i=1}^m(\mathbf{x^{(i)}}^T \theta - y^{(i)}) \mathbf{x}^{(i)} \quad \cdots \quad \textit{linear combination of } \ \mathbf{x}^{(i)}\)
\(= \frac{2}{m}
\begin{pmatrix}
\\
\mathbf{x}^{(1)} & \cdots & \mathbf{x}^{(m)}\\
\\
\end{pmatrix}
\begin{pmatrix}
\mathbf{x^{(1)}}^T \theta- y^{(1)} \\
\vdots \\
\mathbf{x^{(m)}}^T \theta- y^{(m)} \\
\end{pmatrix}\)
\(= \frac{2}{m}
\begin{pmatrix}
\\
\mathbf{x^{(1)}} & \cdots & \mathbf{x^{(m)}}\\
\\
\end{pmatrix} (
\begin{pmatrix}
& \mathbf{x^{(1)}}^T & \\
& \vdots & \\
& \mathbf{x^{(m)}}^T & \\
\end{pmatrix}
\begin{pmatrix}
\\
\theta\\
\\
\end{pmatrix} -
\begin{pmatrix}
\\
\mathbf{y}\\
\\
\end{pmatrix})\)
\(= \frac{2}{m} X^T(X\theta - \mathbf{y})\)