The average value of some function
f(x)
under a probability distribution
p(x)
is called the expectation of
f(x).
For a discrete distribution
E[f]=x∑p(x)f(x)
For a continuous distribution
E[f]=∫p(x)f(x)dx
If we are given a finite number
N
of points drawn from the probability distribution or probability
density, the expectation can be approximated as a
finite sum over these points. The approximation becomes exact in
the limit
N→∞:
E[f]≃N1n=1∑Nf(xn)
Sometimes we will be considering expectations of functions of
several variables, in which case we can use a subscript to
indicate which variable is being averaged over a function of
several variables. So the expectation of the
function
f(x,y)
with respect to the distribution of
x
is denoted by
Ex[f(x,y)]
Note the expectation will be a function of
y. And we use
Ex[f∣y]
to denote a conditional expectation with repect to a
conditional distribution.
Ex[f∣y]=x∑p(x∣y)f(x)=∫p(x∣y)f(x)dx
Bias
The bias of an estimator
θ^
is defined as
B(θ^)=E[θ^−θ]=E[θ^]−θ
The estimator
θ^
is an unbiased estimator of
θ
if and only if
B(θ^)=0
Bayes's theorem allows us to evaluate the uncertainty in
w
in the form of the posterior probability
p(w∣D)
after we have incorporated the evidence provided by the observed
data
D.
p(w∣D)=p(D)p(D∣w)p(w)
The quantity
p(D∣w)
is called the likelihood function. It expresses how
probable the observed data ser is for the specified parameter
vector
w.
p(w)
is the prior probability distribution of
w
before observing the data. We can state Bayes's theorem in
words
posterior∝likelihood×prior
The denominator is the normalization constant. which ensures that
the posterior distribution integrates to one.
p(D)=∫p(D∣w)p(w)dw
In a frequentist setting,
w
is determined by some form of "estimator". A widely used
one is maximum likelihood, in which
w
is set to the value that maximizes the likelihood function
p(D∣w).
The Gaussian Distribution
For the case of a single real-value variable
x, the Gaussian distribution is defined by
The reciprocal of the variance, written as
β=σ21, is called the precision.
N
defined over a D-dimensional vector x of continuous variables with
the covariance
Σ
is given by
N(x∣μ,Σ)=(2π)D/21∣Σ∣1/21exp{−21(x−μ)TΣ−1(x−μ)}
Suppose that we have a data set of N observation that are
independent and identically distributedX, the using the fact that the joint probability of two
independet events is given by the product of marginal
probabilities. The probability of the data set, given
μ
and
σ2
is
p(X∣μ,σ2)=n=1∏NN(xn∣μ,σ2)
Taking the log of the likelihhod function, results in the form
The bias of the maximum likelihood solution becomes less
significant as the number N of data points icreases, and in the
limit
N→∞
the maximum likelihood solution for the variance equals the true
variance of the distribution that generated the data.
Regression
Suppose we observe a real-valued input variable
x
and we wish to use this observation to predict the value of a
real-valued target variable
t. Now suppose that we are given a training set comprising
N
observations of
x, written
x1,...,xN, together with corresponding observations of the values of
t, denoted
t1,...,tN. Individual observations are corrupted by random noise. This
noise might arise from intrinsically stochastic (i.e. random)
processes such as radioactive decay but more typically is due to
there being sources of variability that are themselves unobserved.
Our goal is to exploit this training set in order to make
predictions of the value
t^
of the target variable for some new value
x^
of the input variable. This involves implicitly trying to discover
the underlying function that generated the data. This is
intrinsically a difficult problem as we have to generalize from a
finite data set. Furthermore the observed data are corrupted with
noise, and so for a given
x^
there is uncertainty as to the appropriate value for
t^.
Polynomial Curve Fitting
If we fit the data using a polynomial function of the form
y(x,w)=w0+w1x+w2x2+...+wMxM=j=0∑Mwjxj
where
M
is the order of the polynomial, larger values of
M
are becoming increasingly tuned to the random noise on the target
values. The polynomial coefficients
w0,...,wM
are collectively denoted by the vector
w. Although the polynomial function
y(x,w)
is a nonlinear function of
x, it is a linear function of the coefficients
w. The values of the coefficients will be determined by fitting
the polynomial to the training data. This can be done by
minimizing an cost function that measures the error between the
function
y(x,w), for any given value of
w, and the training set data points. A widely used cost function
is
J(w)=21i=1∑N(y(xi,w)−ti)2
One technique to control the over-fitting phenomenon is that of
regularization, which involves adding a penalty term to the cost
function eq(4) in order to discourage the coefficients from
reaching large values, leading to a modified error function
J~(w)=J(w)+2λ∣∣w∣∣2
We assume that given the value of
x, the corresponding value of
t
has a Gaussian distribution with a mean equal to the value
y(x,w). Thus we can express our uncertainty over the value of the
target variable using a probability distribution.
p(t∣x,w,β)=N(t∣y(x,w),β−1)
We use the training data
{x,w}
to determine the values of the unknown parameters
w
and
β
by maximum likelihood. If the data are assumed to be drawn
independently from the distribution, then the likelihood function
is given by
p(t∣x,w,β)=n=1∏NN(tn∣y(xn,w),β−1)
Substituting for the form of the Gaussian distribution, and take
the logarithm
Maximizes likelihood with respect to
w
we can obtain the
sum-of-squares-error-function defined by eq(4).
Maximizing likehood with respect to
β
gives
β^1=N1n=1∑N{y(xn,w^)−tn}2
Having determined the parameters
w
and
β, we can express our probabilistic model in terms of the
predictive distribution that gives the probability
distribution over t.
p(t∣x,w^,β^)=N(t∣y(x,w^),β^−1)
We introduce a prior distribution over the polynomial coefficients
w. Here use a Gaussian distribution just for simplicity.
p(w∣α)=N(w∣0,α−1I)=(2πα)(M+1)/2exp{−2αwTw}
Where
α
is the precision of the distribution, and
M+1
is the total number of elements in the vector
w
for an
Mth
oder polynomial
Variables such as
α, which control the distribution of model parameters, are called
hyperparameters. Using Bayes's theorem
p(w∣x,t,α,β)∝p(t∣x,w,β)p(w∣α)
We can now determine
w
by finding the most probable value of
w
by maximizing the posterior distribution. Taking the negative
logarithm of eq(6), we find that the maximum of the posterior is
given by the minimum of
2βn=1∑N{y(xn,w^)−tn}2+2αwTw
eq(5) is equivalent to minimize above equation with a
regularization parameter given
λ=βα