安東尼的筆記屋: 2017

2017年10月13日星期五

Evaluating a Hypothesis

為了要評量hypothesis，將data分成training set跟test set

一般來說70%作為training set，30%作為test set

新的步驟如下：

利用training set去學習出$\Theta$使得$J_{train}(\Theta)$為最小
計算test set error $J_{test}(\Theta)$

The test set error

For linear regression

$J_{test}(\Theta) = \dfrac{1}{2m_{test}} \sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2$

For classification

Misclassification error (aka 0/1 misclassification error)
$err(h_\Theta(x),y) = \begin{matrix} 1 & \mbox{if } h_\Theta(x) \geq 0.5\ and\ y = 0\ or\ h_\Theta(x) < 0.5\ and\ y = 1\newline 0 & \mbox otherwise \end{matrix}$

預測失敗就設1，成功就設0

平均的test error為：

$\text{Test Error} = \dfrac{1}{m_{test}} \sum^{m_{test}}_{i=1} err(h_\Theta(x^{(i)}_{test}), y^{(i)}_{test})$

Model Selection and Train/Validation/Test Sets

由於hypothesis在training set表現好，未必代表在test set也能表現好，所以不能只依據training set的表現來評斷hypothesis
其中一個解決此問題的方法是多增加一組data set稱為cross validation set

Training set: 60%
Cross validation set: 20%
Test set: 20%

當我們想要選擇model的polynomial degree時，可以照下列步驟：

針對不同polynomial degree分別算出在training set最佳的Θ
將各個Θ用在validation set找出error最小的degree
在test set上算出上一步驟degree的error作為generalization error

Learning Curves

Bias很大的影響

N很小時，較容易找出quadratic curve來fit data，所以train error很小。
N很小時，train出來的hypothesis對於test set來說不準的機率大很多，所以test error很高。
隨著N增大，兩種error都會進入停滯期。
如果train出來的hypothesis的bias很大，增加training set size不會有幫助。

Variance很大的影響

若Variance很大，則N要很大才會進入停滯期。
若train出來的hypothesis的Variance很大，增加training set size是有幫助的。

Deciding What to Do

當預測失準時，我們可以做下列的修正：

取得更多training data

適用於high variance

將features拆成更小的sets

適用於high variance

增加新的features

適用於high bias

嘗試polynomial features

適用於high bias

減小λ

適用於high bias

增大λ

適用於high variance

Error Analysis

以下是建議用以解決machine learning問題的方法：

快速實作出一個最簡單的演算法，並及早在cross validation set上做測試
畫出learning curve以決定是需要更多data還是更多feature
人工檢視那些在cross validation set上錯誤的例子，看發生最多錯誤的原因為何

必須要能以一個數字來評估演算法(例如：錯誤率)，舉例來說，當在考慮某個feature是否該加進去，若加進去後錯誤率反而增加就知道不該把它考慮進去。

Error metrics for skewed classes

若分類問題中，兩個結果會發生的機率過於極端，則稱此情況為skewed classes

例如：資料中只有0.5%會得到癌症
做training的error rate可能比用猜的機率還低...

改用Precision跟Recall來評估演算法的performance

要將題目中機率特別小的設為y = 1
做下列定義

Precision

$\text{Precision} = \frac{\text{True positives}}{\text{# predicted as positive}} = \frac{\text{True positives}}{\text{True positives + False positives}}$

Recall

$\text{Recall} = \frac{\text{True positives}}{\text{# actual positives}} = \frac{\text{True positives}}{\text{True positives + False negatives}}$

要怎麼用單一數值來評估Precision跟Recall的表現呢

使用F1 Score (或稱為F Score)

$2\frac{PR}{P+R}$

2017年9月24日星期日

Cost Function

首先定義一些在neural network會用到的變數

L：在這個network中有幾層layer
$s_l$：在這個layer中有幾個unit (不含bias unit)
K：總共有幾個output unit，即結果分成幾類
$h_\Theta(x)_k$：產生第k個output的hypothesis

Neural network的cost function定義如下：

Backpropagation Algorithm

Backpropagation是neural network中用來表示最小化cost function的術語，就如同在logistic及linear regression中所用的gradient descent

目標是計算：$\min_\Theta J(\Theta)$

在這一節中我們來看看如何計算J(Θ)的偏微分

$\dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta)$

演算法：

給定一組training set $\lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace$

設定$\Delta^{(l)}_{i,j}$ for all (l,i,j)
$\Delta$代表的是error

For training example t =1 to m:

設定$a^{(1)} := x^{(t)}$
使用forward propogation計算$a^{(l)}$ for l=2,3,…,L

利用$y^{(t)}$計算$\delta^{(L)} = a^{(L)} - y^{(t)}$

這是最後一個layer的error
$\delta$是$\Delta$的小寫
將最後一層得到的結果跟正確值y相減

計算最後一層layer之前的$\delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(2)}$

使用$\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})$
後面兩項代表的是$g'(z^{(l)}) = a^{(l)}\ .*\ (1 - a^{(l)})$，也就是對$g(z^{(l)})$的微分(g-prime derivative)

$\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}$

或是用vector計算：$\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$

最後可以得到新的$\Delta$ matrix

若j≠0

$D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right)$

若j=0

$D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}$

D是作為accumulator用來將所有的值加起來以求出最後的偏微分

$\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)$ = $D_{ij}^{(l)}$

Backpropagation Intuition

$\delta_j^{(l)}$代表的是在$a^{(l)}_j$的error，實際上也可以看成cost function的微分

$\delta_j^{(l)} = \dfrac{\partial}{\partial z_j^{(l)}} cost(t)$

以下面這張圖為例：

第二層的$\delta_2^{(2)}$可透過第三層的$\delta_1^{(3)}$及$\delta_2^{(3)}$來得到

$\delta_2^{(2)}$ == $\Theta_{12}^{(2)}$*$\delta_1^{(3)}$+$\Theta_{22}^{(2)}$*$\delta_2^{(3)}$

Gradient Checking

Gradient checking可以用來確保我們的back propagation是正確的
Cost function的微分結果可以用下列式子來做近似：

$\dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}$

而對於有多個theta的矩陣，對於$Θ_j$微分的近似值則如下：

$\dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n) - J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n)}{2\epsilon}$

將epsilon設小(${\epsilon = 10^{-4}}$)可確保數學式子是正確的
淡epsilon若設的過小會遇到數值上的問題
將近似的結果跟用back propagation得到的delta做比較，若結果正確的話數值不會差太多
由於實作上近似值的計算很慢，所以在檢查完back propagation演算法後就不要再多做了!

Random Initialization

在neural network中若將weight的初始值都設為0的話，在做back propagate時所有的node都會被update成相同的數值，這樣會大大降低準確度
所以我們應該將$\Theta^{(l)}_{ij}$設成介於$[-\epsilon,\epsilon]$間的隨機值

這裡的epsilon跟gradient checking的沒有關聯
epsilon比較好的選擇如下圖，Lin/Lout為該Layer input/output的unit數量

首先要選擇network architecture，hidden unit有幾個，要分成幾個layer

hidden unit：越多越好，但要衡量計算成本
hidden layer：預設為1層，若超過1層，則建議每一層的hidden unit數目一樣\

Training a Neural Network

隨機初始化weights
實作forward propagation以求得$h_\Theta(x^{(i)})$
實作cost function
實作back propagation以算出cost function的偏微分
跑一次gradient checking確認back propagation是對的
使用gradient descent或內建的最佳化方式求出讓cost function最小的weights

2017年9月20日星期三

Neural Networks

在model中，input是$x_1\cdots x_n$的features，output是hypothesis function的結果
$x_0$ input node有時稱為bias unit，其值永遠為1
在neural network中，會使用跟classification一樣的logistic function

$\frac{1}{1 + e^{-\theta^Tx}}$
有時稱之為sigmoid (logistic) activation function

"theta" parameters有時也稱為"weights"
最簡單的表示方式如下：

$\begin{bmatrix}x_0 \newline x_1 \newline x_2 \newline \end{bmatrix}\rightarrow\begin{bmatrix}\ \ \ \newline \end{bmatrix}\rightarrow h_\theta(x)$
input nodes位於input layer (layer 1)，中間經過另一個node (layer 2)，最後產出hypothesis function位於output layer
在input跟output layer之間可能會有不只一層layer，稱其為hidden layers

我們把hidden layer的node稱為activation unit
假設有一層hidden layer，會長的像下面這樣：
每一個activation node的值如下：
每一個layer有它自己的$\Theta^{(j)}$，則其dimension的定義如下：

假設layer j有$s_j$個unit，layer j+1有$s_{j+1}$個unit，則$\Theta^{(j)}$的dimension為$s_{j+1} \times (s_j + 1)$
+1是因為多考慮bias node $x_0$跟$\Theta_0^{(j)}$
output不須考慮bias node，只有input需要考慮

用vector的方式處理：

將activation node以vector表示：$a(j)=g(z(j))$
而$z^{(j)} = \Theta^{(j-1)}a^{(j-1)}$
若還要計算下一層，就在$a^{(j)}$中加入一個bias unit
就可以計算$z^{(j+1)} = \Theta^{(j)}a^{(j)}$

Examples and Intuitions I

一個簡單的例子是以neural network預測$x_1$ AND $x_2$的結果，function會是如下：

$x_0$是bias node，其值為1

Theta matrix則會是：$\Theta^{(1)} =\begin{bmatrix}-30 & 20 & 20\end{bmatrix}$
sigmoid function在z>4後趨近於1，z < 4後趨近於0

所以可以算出如下結果：

Examples and Intuitions II

在上一節中，AND/OR/NOR都可以無須hidden layer就算出來
但XNOR就需要多一層hidden layer才能求出，中間透過AND/OR/NOR的轉換

Multiclass Classification

假設最後的結果不是只有2類，而是4類，那可用大小為4的vector來表示：

$h_\Theta(x)$ 會是這四種可能的vector其中之一

2017年9月18日星期一

Classification

Binary classification problem

y只考慮是0或1，即y∈{0,1}

0又稱為negative class，可用"-"表示
1又稱為positive class，可用"+"表示

Hypothesis Representation

將$h_\theta(x)$的形式改成如下形式

\begin{align*}0 \leq h_\theta (x) \leq 1 \end{align*}

新的形式稱為Sigmoid Function 或Logistic Function

\begin{align*}& h_\theta (x) = g ( \theta^T x ) \newline \newline& z = \theta^T x \newline& g(z) = \dfrac{1}{1 + e^{-z}}\end{align*}

下圖是Sigmoid Function的樣子

$h_\theta(x)$代表的是output為1的機率

假設$h_\theta(x)=0.7$，代表有70%的機率output會是1
而output是0的機率就是1-70%=30%

Decision Boundary

為了要把結果做0跟1的分類，可以把hypothesis function的output轉譯成如下：

\begin{align*}& h_\theta(x) \geq 0.5 \rightarrow y = 1 \newline& h_\theta(x) < 0.5 \rightarrow y = 0 \newline\end{align*}

從前面Sigmoid Function的圖可知當z > 0時，g(z) >= 0.5
若z為$\theta^T X$則代表：
所以
Decision Boundary就是用來區分y=0跟y=1區域的那條線

它不一定要直線，可以是圓形線或任何形狀

Cost Function

若Logistic Function使用Linear Regression的Cost Function，則會是波浪的形狀有很多Local Optima
Logistic Regression的Cost Function如下
當y=1時會得到$J(\theta)$跟$h_\theta (x)$的圖

當y=0時會得到$J(\theta)$跟$h_\theta (x)$的圖

整理如下：

Simplified Cost Function and Gradient Descent

我們可以把上面的Cost()整合成一條式子

$\mathrm{Cost}(h_\theta(x),y) = - y \; \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))$

整體的Cost Function如下表示：

$J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$

用vector的方式實作如下：

Gradient Descent

Gradient Descent的表示式跟Linear Regression一樣
用vector的方式實作如下：

Advanced Optimization

除了Gradient Descent之外有其他更好的方法

Conjugate gradient
BFGS
L-BFGS

但是建議不要自己實做這些複雜的演算法，而是找已經最佳化過的Library
我們需要提供一個Function去計算下面兩個結果
首先寫一個單一Function回傳這兩個結果

function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

接著用下面的function算出最佳解

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multiclass Classification: One-vs-all

當類別超過兩類時，定義 y = {0,1...n}
把這個問題分成n+1個Binary Classification Problem，將最大的結果當成預測值

Regularization

為了避免overfitting，可以在cost function中加上對weight的懲罰，所以cost function改寫如下：

$min_\theta\ \dfrac{1}{2m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2$

Regularized Linear Regression

Gradient Descent

演算法改寫如下，要注意的是並沒有對$\theta_0$做懲罰
在經過整理後如下：

$\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$
由於$1 - \alpha\frac{\lambda}{m}$ 永遠會小於1，而後面的項目跟沒做regularization前一樣，所以weight會減小的範圍就決定在$1 - \alpha\frac{\lambda}{m}$

Normal Equation

式子改寫如下：
前面提過若m<n則$X^TX$是不可逆的，但在改寫後$X^TX$ + λ⋅L是可逆的。

Regularized Logistic Regression

和Regularized Linear Regression的方法一樣

Multiple Features

有多個變數的Linear Regression稱為multivariate linear regression
符號說明如下：

\begin{align*}x_j^{(i)} &= \text{value of feature } j \text{ in the }i^{th}\text{ training example} \newline x^{(i)}& = \text{the input (features) of the }i^{th}\text{ training example} \newline m &= \text{the number of training examples} \newline n &= \text{the number of features} \end{align*}

其hypothesis function的形式為：

\begin{align*}h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n\end{align*}

也可用matrix的形式來表示：

\begin{align*}h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x\end{align*}

Gradient Descent For Multiple Variables

演算法如下：

\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align*}

Feature Scaling

當Feature的range很大時，Descent的速度會很慢
Input Variable的理想range

−1 ≤ x(i) ≤ 1
大約就好，太大太小都不好

Mean normalization

μi 是feature (i)的平均值
si 是the range of values (max - min)或標準差

Normal Equation

為了得到Cost Function J的最小值，可以針對各個θi做偏微分後設成0，就可以求出θ的最佳解如下：

\begin{align*}\theta = (X^T X)^{-1}X^T y\end{align*}

X, y的例子如下

使用Normal Equation就不需要做Feature Scaling
Gradient Descent跟Normal Equation的比較如下：

Gradient Descent	Normal Equation
Need to choose alpha	No need to choose alpha
Needs many iterations	No need to iterate
O (kn2)	O (n3), need to calculate inverse of XTX
Works well when n is large	Slow if n is very large