安東尼的筆記屋: 10月 2017

2017年10月13日星期五

Week6

Evaluating a Hypothesis

為了要評量hypothesis，將data分成training set跟test set

一般來說70%作為training set，30%作為test set

新的步驟如下：

利用training set去學習出$\Theta$使得$J_{train}(\Theta)$為最小
計算test set error $J_{test}(\Theta)$

The test set error

For linear regression

$J_{test}(\Theta) = \dfrac{1}{2m_{test}} \sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2$

For classification

Misclassification error (aka 0/1 misclassification error)
$err(h_\Theta(x),y) = \begin{matrix} 1 & \mbox{if } h_\Theta(x) \geq 0.5\ and\ y = 0\ or\ h_\Theta(x) < 0.5\ and\ y = 1\newline 0 & \mbox otherwise \end{matrix}$

預測失敗就設1，成功就設0

平均的test error為：

$\text{Test Error} = \dfrac{1}{m_{test}} \sum^{m_{test}}_{i=1} err(h_\Theta(x^{(i)}_{test}), y^{(i)}_{test})$

Model Selection and Train/Validation/Test Sets

由於hypothesis在training set表現好，未必代表在test set也能表現好，所以不能只依據training set的表現來評斷hypothesis
其中一個解決此問題的方法是多增加一組data set稱為cross validation set

Training set: 60%
Cross validation set: 20%
Test set: 20%

當我們想要選擇model的polynomial degree時，可以照下列步驟：

針對不同polynomial degree分別算出在training set最佳的Θ
將各個Θ用在validation set找出error最小的degree
在test set上算出上一步驟degree的error作為generalization error

Learning Curves

Bias很大的影響

N很小時，較容易找出quadratic curve來fit data，所以train error很小。
N很小時，train出來的hypothesis對於test set來說不準的機率大很多，所以test error很高。
隨著N增大，兩種error都會進入停滯期。
如果train出來的hypothesis的bias很大，增加training set size不會有幫助。

Variance很大的影響

若Variance很大，則N要很大才會進入停滯期。
若train出來的hypothesis的Variance很大，增加training set size是有幫助的。

Deciding What to Do

當預測失準時，我們可以做下列的修正：

取得更多training data

適用於high variance

將features拆成更小的sets

適用於high variance

增加新的features

適用於high bias

嘗試polynomial features

適用於high bias

減小λ

適用於high bias

增大λ

適用於high variance

Error Analysis

以下是建議用以解決machine learning問題的方法：

快速實作出一個最簡單的演算法，並及早在cross validation set上做測試
畫出learning curve以決定是需要更多data還是更多feature
人工檢視那些在cross validation set上錯誤的例子，看發生最多錯誤的原因為何

必須要能以一個數字來評估演算法(例如：錯誤率)，舉例來說，當在考慮某個feature是否該加進去，若加進去後錯誤率反而增加就知道不該把它考慮進去。

Error metrics for skewed classes

若分類問題中，兩個結果會發生的機率過於極端，則稱此情況為skewed classes

例如：資料中只有0.5%會得到癌症
做training的error rate可能比用猜的機率還低...

改用Precision跟Recall來評估演算法的performance

要將題目中機率特別小的設為y = 1
做下列定義

Precision

$\text{Precision} = \frac{\text{True positives}}{\text{# predicted as positive}} = \frac{\text{True positives}}{\text{True positives + False positives}}$

Recall

$\text{Recall} = \frac{\text{True positives}}{\text{# actual positives}} = \frac{\text{True positives}}{\text{True positives + False negatives}}$

要怎麼用單一數值來評估Precision跟Recall的表現呢

使用F1 Score (或稱為F Score)

$2\frac{PR}{P+R}$

訂閱：文章 (Atom)