Contents

吴恩达 ML 公开课笔记(3)-Multivariate Linear Regression

Multivariate Linear Regression

Gradient Descent for Multiple Variables

  1. Suppose we have n variables, set hypothesis to be: $$h_\theta(\mathbf{x}) = \sum_{i=0}^n \theta_i x_i = \theta^T \mathbf{x}, x_0 =1$$ in which $\mathbf{x} = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}$, $\mathbf{\theta} = \begin{bmatrix}\theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}$

  2. Cost Function $$J(\theta^T) = \frac {1}{2m} \sum_{i=1}^m (h_\theta (\mathbf{x}^{(i)}) - y^{(i)} )^2$$

  3. Gredient Descent Algorithm $$\begin{aligned} \text{repeat until convergence: } \lbrace & \\ \theta_j := & \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(\mathbf{x}^{(i)}) - y^{(i)}) x_j^{(i)}\right) \ \rbrace&\end{aligned}$$

  4. Feature Scalling

  • Get every feature into approximately [-1, 1]. Just normalize all the parameters :)
  • $x_i := \frac {xi-\mu}{\sigma}$
  • Do not apply to $x_0$
  1. Learning Rate
  • Not too big(fail to converge), not too small(too slow)
  1. Polynormal Regression
  • Use feature scalling. (Somewhat like normalizing dimension)

Normal Equation

  1. Set $\mathbf{X} = \begin{bmatrix}\mathbf{x_1} & \mathbf{x_2} & \cdots & \mathbf{x_m} \end{bmatrix}$, $\mathbf{y^T} = \begin{bmatrix} y_1 & y_2 & \cdots & y_m \end{bmatrix}$ as our training data:
  • If $\exists \mathbf{\theta_h} \forall \mathbf{x_i} \in \mathbf{X} (\theta_h^T \mathbf{x} = y^{(i)})$, then $\mathbf{\theta_h}$ should be our perfect fit of the hypothesis. However, in most cases this fit doesn’t exist, so we should find a set of $\mathbf{\theta}$ as close as the best fit.
  • The question above equals solving this set of linear equations: $\mathbf{X^T} \mathbf{\theta} = \mathbf{y}$, while in most cases there’s no solution. Instead we try to find a $\mathbf{\theta}$ let $\mathbf{X^T} \mathbf{\theta}$ as close as $\mathbf{y}$.
  • Let $\Delta = |\mathbf{y} - \mathbf{X^T} \mathbf{\theta}|$, we need this $\Delta$ to be as small as possible. Use a little imagination we could get the right answer: In this case, the right $\mathbf{\theta}$ would let $\mathbf{X^T} \mathbf{\theta}$ be the projection of $\mathbf{y}$ to the column space of $\mathbf{X^T}$, which means $\Delta$ vertical to the column space of $\mathbf{X^T}$, which means: $$\mathbf{X} \Delta = \mathbf{0}$$ so $\mathbf{X} (\mathbf{y} - \mathbf{X^T} \mathbf{\theta}) = \mathbf{0}$, so $\mathbf{\theta} = {(\mathbf{X}\mathbf{X^T})}^{-1} \mathbf{X}\mathbf{y}$
  • Notice our $\mathbf{X}$ is a little different from the course video, in which the $x_i$ are arranged parallel while our $x_i$ are arranged vertically.
  • Feature scalling is not necessary here.
  1. Gradient Descent vs Normal Equation
  • Just know Gradient Descent works well even when n is very large, while Normal Equation gets slow since it needs a lot calculation solving ${(\mathbf{X}\mathbf{X^T})}^{-1}$