Saturday 22 February 2014

Supervised Learning : A Mathematical Foundation

Supervised Learning as per Wiki

Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.


In Short : Learning from historical data a mapping between input and output variables and applying this to predict output for an unseen data.

STEPS of SUPERVISED LEARNING

  • Determine type of training example
    • What kind of data to be used.
      • Handwriting analysis : single handwritten character, word,line
  • Gather a Training set
  • Determine Input feature representation of learned function.
    • Input object transformed into feature vector.
  • Determine structure of learned function and corresponding learning algorithm.
    • SVM or DT
  • Complete the design.
    • Run Algorithm on Training set.
  • Evaluate accuracy.

 Mathematics behind Supervised Learning

Goal : Infer a function
f : X Y
Classifier from sample data An
An = ((x1, y1), ..., (xn, yn)) ∈ (X × Y )n.
Input and Output points xi and yi Y 


yi IR for regression problems, and yi is discrete for classification problems {-1, +1}

 
2 Hypothesis : 
1)Find a function f to model depandency in P(x,y) 
2)Error or loss between prediction f(x) desired output y.  

Loss function : L : Y ×Y IR+

If Binary Classification Y in {-1,+1}   L( h (x), y)) = 1/2 | h(x) − y|.
Unsupervised Learning a function can be
Lu( h(x)) = log( h(x)).
 Risk for a function or Generalization Error 

R( h ) =    L( h(x), y)dP(x, y). 
Classification : Function f that minimises R(f) but joint probablity P(x,y) unknown
 Using Input and Random variables Risk can be written as 

R(h) = \mathbf{E}[L(h(x), y)] = \int L(h(x), y)\,dP(x, y).


Empirical Risk Minimisation : 

Learning algo find Hypothesis h such that R(h) minimum




h^* = \arg \min_{h \in \mathcal{H}} R(h).
R(h) cannot be computed  so approximation Empirical Risk by leveraging loss  





\! R_\mbox{emp}(h) = \frac{1}{m} \sum_{i=1}^m L(h(x_i), y_i).
for Large numbers As m goes to infinity point wise convergence of h to R(h)
Minimise Rem
\hat{h} = \arg \min_{h \in \mathcal{H}} R_{\mbox{emp}}(h). 
 
Consistency is important : 
 
Thus learning depands upon function h or f in above case 
It has to have smooth boundries , regularization is applied : Regularization theory.

Regularized Risk : 
Ω ( f ) : Roughness Penalty


Risk Bounds 

H , An and δ , such that, for any f ∈ H , with a probability at least 1 − δ :
 
consider the case of a finite class of functions, |H | = N.   
Hoeffding Inequality by summing over set : 
  Risk =  sum of 2 terms : Empirical error and bound depand on size.
 As n -> infintiy second term tends to 0
Bias Variance Dilemma
H large one can find f that fits the data but noise due to large points resulting in poor performance 
Called Overfitting 

Overfitting according to Wiki
Generally, a learning algorithm is said to overfit relative to a simpler one if it is more accurate in fitting known data (hindsight) but less accurate in predicting new data (foresight).

VC dimension
Measure of capacity of function to different label of class
 

 


STRUCTURAL RISK MINIMISATION
Structural risk minimization seeks to prevent overfitting by incorporating a regularization penalty into the optimization. The regularization penalty can be viewed as implementing a form of Occam's razor that prefers simpler functions over more complex ones


 
 

select classifier that maximizes gamma  .



*Formulas taken from paper by cunningham  
http://tinyurl.com/ll4jxhq