安東尼的筆記屋: 9月 2017

2017年9月24日星期日

Cost Function

首先定義一些在neural network會用到的變數

L：在這個network中有幾層layer
$s_l$：在這個layer中有幾個unit (不含bias unit)
K：總共有幾個output unit，即結果分成幾類
$h_\Theta(x)_k$：產生第k個output的hypothesis

Neural network的cost function定義如下：

Backpropagation Algorithm

Backpropagation是neural network中用來表示最小化cost function的術語，就如同在logistic及linear regression中所用的gradient descent

目標是計算：$\min_\Theta J(\Theta)$

在這一節中我們來看看如何計算J(Θ)的偏微分

$\dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta)$

演算法：

給定一組training set $\lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace$

設定$\Delta^{(l)}_{i,j}$ for all (l,i,j)
$\Delta$代表的是error

For training example t =1 to m:

設定$a^{(1)} := x^{(t)}$
使用forward propogation計算$a^{(l)}$ for l=2,3,…,L

利用$y^{(t)}$計算$\delta^{(L)} = a^{(L)} - y^{(t)}$

這是最後一個layer的error
$\delta$是$\Delta$的小寫
將最後一層得到的結果跟正確值y相減

計算最後一層layer之前的$\delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(2)}$

使用$\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})$
後面兩項代表的是$g'(z^{(l)}) = a^{(l)}\ .*\ (1 - a^{(l)})$，也就是對$g(z^{(l)})$的微分(g-prime derivative)

$\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}$

或是用vector計算：$\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$

最後可以得到新的$\Delta$ matrix

若j≠0

$D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right)$

若j=0

$D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}$

D是作為accumulator用來將所有的值加起來以求出最後的偏微分

$\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)$ = $D_{ij}^{(l)}$

Backpropagation Intuition

$\delta_j^{(l)}$代表的是在$a^{(l)}_j$的error，實際上也可以看成cost function的微分

$\delta_j^{(l)} = \dfrac{\partial}{\partial z_j^{(l)}} cost(t)$

以下面這張圖為例：

第二層的$\delta_2^{(2)}$可透過第三層的$\delta_1^{(3)}$及$\delta_2^{(3)}$來得到

$\delta_2^{(2)}$ == $\Theta_{12}^{(2)}$*$\delta_1^{(3)}$+$\Theta_{22}^{(2)}$*$\delta_2^{(3)}$

Gradient Checking

Gradient checking可以用來確保我們的back propagation是正確的
Cost function的微分結果可以用下列式子來做近似：

$\dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}$

而對於有多個theta的矩陣，對於$Θ_j$微分的近似值則如下：

$\dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n) - J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n)}{2\epsilon}$

將epsilon設小(${\epsilon = 10^{-4}}$)可確保數學式子是正確的
淡epsilon若設的過小會遇到數值上的問題
將近似的結果跟用back propagation得到的delta做比較，若結果正確的話數值不會差太多
由於實作上近似值的計算很慢，所以在檢查完back propagation演算法後就不要再多做了!

Random Initialization

在neural network中若將weight的初始值都設為0的話，在做back propagate時所有的node都會被update成相同的數值，這樣會大大降低準確度
所以我們應該將$\Theta^{(l)}_{ij}$設成介於$[-\epsilon,\epsilon]$間的隨機值

這裡的epsilon跟gradient checking的沒有關聯
epsilon比較好的選擇如下圖，Lin/Lout為該Layer input/output的unit數量

首先要選擇network architecture，hidden unit有幾個，要分成幾個layer

hidden unit：越多越好，但要衡量計算成本
hidden layer：預設為1層，若超過1層，則建議每一層的hidden unit數目一樣\

Training a Neural Network

隨機初始化weights
實作forward propagation以求得$h_\Theta(x^{(i)})$
實作cost function
實作back propagation以算出cost function的偏微分
跑一次gradient checking確認back propagation是對的
使用gradient descent或內建的最佳化方式求出讓cost function最小的weights

2017年9月20日星期三

Neural Networks

在model中，input是$x_1\cdots x_n$的features，output是hypothesis function的結果
$x_0$ input node有時稱為bias unit，其值永遠為1
在neural network中，會使用跟classification一樣的logistic function

$\frac{1}{1 + e^{-\theta^Tx}}$
有時稱之為sigmoid (logistic) activation function

"theta" parameters有時也稱為"weights"
最簡單的表示方式如下：

$\begin{bmatrix}x_0 \newline x_1 \newline x_2 \newline \end{bmatrix}\rightarrow\begin{bmatrix}\ \ \ \newline \end{bmatrix}\rightarrow h_\theta(x)$
input nodes位於input layer (layer 1)，中間經過另一個node (layer 2)，最後產出hypothesis function位於output layer
在input跟output layer之間可能會有不只一層layer，稱其為hidden layers

我們把hidden layer的node稱為activation unit
假設有一層hidden layer，會長的像下面這樣：
每一個activation node的值如下：
每一個layer有它自己的$\Theta^{(j)}$，則其dimension的定義如下：

假設layer j有$s_j$個unit，layer j+1有$s_{j+1}$個unit，則$\Theta^{(j)}$的dimension為$s_{j+1} \times (s_j + 1)$
+1是因為多考慮bias node $x_0$跟$\Theta_0^{(j)}$
output不須考慮bias node，只有input需要考慮

用vector的方式處理：

將activation node以vector表示：$a(j)=g(z(j))$
而$z^{(j)} = \Theta^{(j-1)}a^{(j-1)}$
若還要計算下一層，就在$a^{(j)}$中加入一個bias unit
就可以計算$z^{(j+1)} = \Theta^{(j)}a^{(j)}$

Examples and Intuitions I

一個簡單的例子是以neural network預測$x_1$ AND $x_2$的結果，function會是如下：

$x_0$是bias node，其值為1

Theta matrix則會是：$\Theta^{(1)} =\begin{bmatrix}-30 & 20 & 20\end{bmatrix}$
sigmoid function在z>4後趨近於1，z < 4後趨近於0

所以可以算出如下結果：

Examples and Intuitions II

在上一節中，AND/OR/NOR都可以無須hidden layer就算出來
但XNOR就需要多一層hidden layer才能求出，中間透過AND/OR/NOR的轉換

Multiclass Classification

假設最後的結果不是只有2類，而是4類，那可用大小為4的vector來表示：

$h_\Theta(x)$ 會是這四種可能的vector其中之一

2017年9月18日星期一

Classification

Binary classification problem

y只考慮是0或1，即y∈{0,1}

0又稱為negative class，可用"-"表示
1又稱為positive class，可用"+"表示

Hypothesis Representation

將$h_\theta(x)$的形式改成如下形式

\begin{align*}0 \leq h_\theta (x) \leq 1 \end{align*}

新的形式稱為Sigmoid Function 或Logistic Function

\begin{align*}& h_\theta (x) = g ( \theta^T x ) \newline \newline& z = \theta^T x \newline& g(z) = \dfrac{1}{1 + e^{-z}}\end{align*}

下圖是Sigmoid Function的樣子

$h_\theta(x)$代表的是output為1的機率

假設$h_\theta(x)=0.7$，代表有70%的機率output會是1
而output是0的機率就是1-70%=30%

Decision Boundary

為了要把結果做0跟1的分類，可以把hypothesis function的output轉譯成如下：

\begin{align*}& h_\theta(x) \geq 0.5 \rightarrow y = 1 \newline& h_\theta(x) < 0.5 \rightarrow y = 0 \newline\end{align*}

從前面Sigmoid Function的圖可知當z > 0時，g(z) >= 0.5
若z為$\theta^T X$則代表：
所以
Decision Boundary就是用來區分y=0跟y=1區域的那條線

它不一定要直線，可以是圓形線或任何形狀

Cost Function

若Logistic Function使用Linear Regression的Cost Function，則會是波浪的形狀有很多Local Optima
Logistic Regression的Cost Function如下
當y=1時會得到$J(\theta)$跟$h_\theta (x)$的圖

當y=0時會得到$J(\theta)$跟$h_\theta (x)$的圖

整理如下：

Simplified Cost Function and Gradient Descent

我們可以把上面的Cost()整合成一條式子

$\mathrm{Cost}(h_\theta(x),y) = - y \; \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))$

整體的Cost Function如下表示：

$J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$

用vector的方式實作如下：

Gradient Descent

Gradient Descent的表示式跟Linear Regression一樣
用vector的方式實作如下：

Advanced Optimization

除了Gradient Descent之外有其他更好的方法

Conjugate gradient
BFGS
L-BFGS

但是建議不要自己實做這些複雜的演算法，而是找已經最佳化過的Library
我們需要提供一個Function去計算下面兩個結果
首先寫一個單一Function回傳這兩個結果

function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

接著用下面的function算出最佳解

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multiclass Classification: One-vs-all

當類別超過兩類時，定義 y = {0,1...n}
把這個問題分成n+1個Binary Classification Problem，將最大的結果當成預測值

Regularization

為了避免overfitting，可以在cost function中加上對weight的懲罰，所以cost function改寫如下：

$min_\theta\ \dfrac{1}{2m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2$

Regularized Linear Regression

Gradient Descent

演算法改寫如下，要注意的是並沒有對$\theta_0$做懲罰
在經過整理後如下：

$\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$
由於$1 - \alpha\frac{\lambda}{m}$ 永遠會小於1，而後面的項目跟沒做regularization前一樣，所以weight會減小的範圍就決定在$1 - \alpha\frac{\lambda}{m}$

Normal Equation

式子改寫如下：
前面提過若m<n則$X^TX$是不可逆的，但在改寫後$X^TX$ + λ⋅L是可逆的。

Regularized Logistic Regression

和Regularized Linear Regression的方法一樣

Multiple Features

有多個變數的Linear Regression稱為multivariate linear regression
符號說明如下：

\begin{align*}x_j^{(i)} &= \text{value of feature } j \text{ in the }i^{th}\text{ training example} \newline x^{(i)}& = \text{the input (features) of the }i^{th}\text{ training example} \newline m &= \text{the number of training examples} \newline n &= \text{the number of features} \end{align*}

其hypothesis function的形式為：

\begin{align*}h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n\end{align*}

也可用matrix的形式來表示：

\begin{align*}h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x\end{align*}

Gradient Descent For Multiple Variables

演算法如下：

\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align*}

Feature Scaling

當Feature的range很大時，Descent的速度會很慢
Input Variable的理想range

−1 ≤ x(i) ≤ 1
大約就好，太大太小都不好

Mean normalization

μi 是feature (i)的平均值
si 是the range of values (max - min)或標準差

Normal Equation

為了得到Cost Function J的最小值，可以針對各個θi做偏微分後設成0，就可以求出θ的最佳解如下：

\begin{align*}\theta = (X^T X)^{-1}X^T y\end{align*}

X, y的例子如下

使用Normal Equation就不需要做Feature Scaling
Gradient Descent跟Normal Equation的比較如下：

Gradient Descent	Normal Equation
Need to choose alpha	No need to choose alpha
Needs many iterations	No need to iterate
O (kn2)	O (n3), need to calculate inverse of XTX
Works well when n is large	Slow if n is very large

實作上當n超過10000的時候就會考慮用Gradient Descent

Normal Equation Noninvertibility

如果X^T X是不可逆的，可能的原因如下：

有重複的Feature出現 (如線性相關的兩個Feature(ft/m))
Feature數量太多 (m≤ n)

刪除一些Feature或做Regularization

2017年9月6日星期三

Week1

Supervised Learning

定義:

有一筆已知正確結果的資料，找出input和output之間的關係。

分成兩類的問題

Regression

在一組連續的output中預測結果
把input variables map到連續性的function

Classification

預測的結果為一組離散的(discrete) output
把input variables map到離散的類別

Unsupervised Learning

對於結果會長怎樣所知甚少甚至是完全不知道
從資料中的variables找出關聯性並將資料分成小的group
例子:

Clustering

將基因做分群

Non-Clustering

將音訊檔區別人聲和音樂聲

Model Representation

對於supervised learning problem，流程如下圖。從Training set train出一個function h，這個h由於歷史因素稱為hypothesis。

Cost Function

用來計算hypothesis精確度的function稱為cost function如下

他又稱為Squared error function 或 Mean squared error
1/2是用來做gradient descent時較方便計算，作微分後就可以把1/2給消掉

Gradient Descent

以下是 Gradient Descent的演算法，一直repeat直到收斂為止

不同的起點可能會導致最後找到的local minimum點不同
α代表的是Step的參數

太小=>收斂很慢
太大=>無法收斂或甚至變成發散

越接近local minimum，斜率就會越小，整體Step自然也越小，所以不需要去減小α

訂閱：意見 (Atom)

安東尼的筆記屋

2017年9月24日星期日

Week5

Cost Function

Backpropagation Algorithm

Backpropagation Intuition

Gradient Checking

Random Initialization

2017年9月20日星期三

Week4

Neural Networks

Examples and Intuitions I

Examples and Intuitions II

Multiclass Classification

2017年9月18日星期一

Week3

Classification

Hypothesis Representation

Decision Boundary

Cost Function

Simplified Cost Function and Gradient Descent

Gradient Descent

Advanced Optimization

Multiclass Classification: One-vs-all

Regularization

Week2

Multiple Features

Gradient Descent For Multiple Variables

Normal Equation

Normal Equation Noninvertibility

2017年9月6日星期三

Week1

Supervised Learning

Unsupervised Learning

Model Representation

Cost Function

Gradient Descent

熱門文章

2017年9月24日 星期日

Cost Function

Backpropagation Algorithm

Backpropagation Intuition

Gradient Checking

Random Initialization

2017年9月20日 星期三

Neural Networks

Examples and Intuitions I

Examples and Intuitions II

Multiclass Classification

2017年9月18日 星期一

Classification

Hypothesis Representation

Decision Boundary

Cost Function

Simplified Cost Function and Gradient Descent

Gradient Descent

Advanced Optimization

Multiclass Classification: One-vs-all

Regularization

Multiple Features

Gradient Descent For Multiple Variables

Normal Equation

Normal Equation Noninvertibility

2017年9月6日 星期三

Supervised Learning

Unsupervised Learning

Model Representation

Cost Function

Gradient Descent

2017年9月24日星期日

2017年9月20日星期三

2017年9月18日星期一

2017年9月6日星期三