Probability

Author: Gao

Tags:

Expectations

The average value of some function f(x)f(x) under a probability distribution p(x)p(x) is called the expectation of f(x)f(x).

For a discrete distribution

E[f]=xp(x)f(x)\mathop{\mathbb{E}}[f] = \sum_{x}p(x)f(x)

For a continuous distribution

E[f]=p(x)f(x)dx\mathop{\mathbb{E}}[f] = \int p(x)f(x)dx

If we are given a finite number NN of points drawn from the probability distribution or probability density, the expectation can be approximated as a finite sum over these points. The approximation becomes exact in the limit NN \to \infty:

E[f]1Nn=1Nf(xn)\mathop{\mathbb{E}}[f] \simeq \frac{1}{N}\sum^{N}_{n=1}{f(x_{n})}

Sometimes we will be considering expectations of functions of several variables, in which case we can use a subscript to indicate which variable is being averaged over a function of several variables. So the expectation of the function f(x,y)f(x, y) with respect to the distribution of xx is denoted by

Ex[f(x,y)]\mathop{\mathbb{E}_{x}}[f(x, y)]

Note the expectation will be a function of yy. And we use Ex[fy]\mathop{\mathbb{E}_{x}}[f\mid y] to denote a conditional expectation with repect to a conditional distribution.

Ex[fy]=xp(xy)f(x)=p(xy)f(x)dx\mathop{\mathbb{E}_{x}}[f\mid y] = \sum_{x} p(x\mid y)f(x) = \int{p(x\mid y)f(x)\,dx}

Bias

The bias of an estimator θ^\hat{\theta} is defined as

B(θ^)=E[θ^θ]=E[θ^]θB(\hat{\theta}) = \mathop{\mathbb{E}}[\hat{\theta} - \theta] = \mathop{\mathbb{E}}[\hat{\theta}] - \theta

The estimator θ^\hat{\theta} is an unbiased estimator of θ\theta if and only if B(θ^)=0B(\hat{\theta}) = 0

Covariances

The covariance and variance is defined by

cov(f,g)=Ex,y[(f(x)E[f(x)])(g(y)E[g(y)])]=Ex,y[(f(x)g(y)]Ex,y[f(x)E[g(y)]]Ex,y[g(y)E[f(x)]]+Ex,y[E[f(x)]E[g(y)]]=Ex,y[(f(x)g(y)]E[f(x)]E[g(y)]E[g(y)]E[f(x)]+E[f(x)]E[g(y)]=Ex,y[(f(x)g(y)]E[f(x)]E[g(y)]\begin{equation} \begin{split} cov(f,g) & = \mathop{\mathbb{E}_{x, y}}\big[(f(x) - \mathop{\mathbb{E}}[f(x)])(g(y) - \mathop{\mathbb{E}}[g(y)])\big]\\ & = \mathop{\mathbb{E}_{x, y}}\big[(f(x)g(y)\big] - \mathop{\mathbb{E}_{x, y}}\big[f(x)\mathop{\mathbb{E}}[g(y)]\big] - \mathop{\mathbb{E}_{x, y}}\big[g(y)\mathop{\mathbb{E}}[f(x)]\big] + \mathop{\mathbb{E}_{x, y}}\big[\mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)]\big]\\ & = \mathop{\mathbb{E}_{x, y}}[(f(x)g(y)] - \mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)] - \mathop{\mathbb{E}}[g(y)]\mathop{\mathbb{E}}[f(x)] + \mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)]\\ & = \mathop{\mathbb{E}_{x, y}}[(f(x)g(y)] - \mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)]\\ \end{split} \end{equation} var(f)=E[(f(x)E[f(x)])2]var(f) = \mathop{\mathbb{E}}\big[\big(f(x) - \mathop{\mathbb{E}}[f(x)]\big)^{2}\big] var(f)=E[(f(x)2]E[f(x)]2var(f) = \mathop{\mathbb{E}}[(f(x)^{2}] - \mathop{\mathbb{E}}[f(x)]^{2}

If f(x)=xf(x) = x, g(y)=yg(y) = y

var(x)=cov(x,x)=E[x2]E[x]2var(x) = cov(x, x) = \mathop{\mathbb{E}}[x^{2}] - \mathop{\mathbb{E}}[x]^{2} cov(x,y)=Ex,y[xy]E[x]E[y]cov(x, y) = \mathop{\mathbb{E}_{x, y}}[xy] - \mathop{\mathbb{E}}[x]\mathop{\mathbb{E}}[y]

for xRm\textbf{x}\in R^{m} and yRn\textbf{y}\in R^{n}, the result is a matrix.

cov(x,y)=Ex,y[xyT]E[x]E[y]Tcov(\textbf{x}, \textbf{y}) = \mathop{\mathbb{E}_{\textbf{x}, \textbf{y}}}[\textbf{x}\textbf{y}^{T}] - \mathop{\mathbb{E}}[\textbf{x}]\mathop{\mathbb{E}}[\textbf{y}]^{T}

The covariance matrix generalizes the notion of variance to multiple dimensions.

Σ(x)=cov(x,x)=[E[(x1E[x1])(x1E[x1])]E[(x1E[x1])(x2E[x2])]E[(x1E[x1])(xnE[xn])]E[(x2E[x2])(x1E[x1])]E[(x2E[x2])(x2E[x2])]E[(x2E[x2])(xnE[xn])]E[(xnE[xn])(x1E[x1])]E[(xnE[xn])(x2E[x2])]E[(xnE[xn])(xnE[xn])]]\begin{equation} \begin{split} \Sigma(\textbf{x}) & = cov(\textbf{x}, \textbf{x})\\ & = \begin{bmatrix} \mathop{\mathbb{E}}\bigg[\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\bigg] & \mathop{\mathbb{E}}\bigg[\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\bigg] & \dots & \mathop{\mathbb{E}}\bigg[\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\bigg]\\ \mathop{\mathbb{E}}\bigg[\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\bigg] & \mathop{\mathbb{E}}\bigg[\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\bigg] & \dots & \mathop{\mathbb{E}}\bigg[\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\bigg] \\ \vdots & \vdots & \ddots & \vdots \\ \mathop{\mathbb{E}}\bigg[\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\bigg] & \mathop{\mathbb{E}}\bigg[\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\bigg] & \dots & \mathop{\mathbb{E}}\bigg[\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\bigg] \end{bmatrix} \end{split} \end{equation}

Bayesian Probabilities

Bayes's theorem allows us to evaluate the uncertainty in w\textbf{w} in the form of the posterior probability p(wD)p(\textbf{w}\mid \mathcal{D}) after we have incorporated the evidence provided by the observed data D\mathcal{D}.

p(wD)=p(Dw)p(w)p(D)p(\textbf{w}\mid \mathcal{D}) = \frac{p(\mathcal{D}\mid\textbf{w})p(\textbf{w})}{p(\mathcal{D})}

The quantity p(Dw)p(\mathcal{D}\mid\textbf{w}) is called the likelihood function. It expresses how probable the observed data ser is for the specified parameter vector w\textbf{w}. p(w)p(\textbf{w}) is the prior probability distribution of w\textbf{w} before observing the data. We can state Bayes's theorem in words

posteriorlikelihood×priorposterior \propto likelihood \times prior

The denominator is the normalization constant. which ensures that the posterior distribution integrates to one.

p(D)=p(Dw)p(w)dwp(\mathcal{D}) = \int p(\mathcal{D}\mid\textbf{w})p(\textbf{w})d\textbf{w}

In a frequentist setting, w\textbf{w} is determined by some form of "estimator". A widely used one is maximum likelihood, in which w\textbf{w} is set to the value that maximizes the likelihood function p(Dw)p(\mathcal{D}\mid\textbf{w}).

The Gaussian Distribution

For the case of a single real-value variable xx, the Gaussian distribution is defined by

N(xμ,σ2)=1(2πσ2)1/2exp{12σ2(xμ)2}\mathcal{N}(x\mid\mu, \sigma^{2}) = \frac{1}{(2\pi\sigma^{2})^{1/2}}exp\left\{-\frac{1}{2\sigma^{2}}(x - \mu)^{2}\right\} E[x]=μ="mean"\mathop{\mathbb{E}}[x] = \mu = \text{"mean"} E[x2]=μ2+σ2\mathop{\mathbb{E}}[x^{2}] = \mu^{2} + \sigma^{2} var[x]=E[x2]E[x]2=σ2="variance"var[x] = \mathop{\mathbb{E}}[x^{2}] - \mathop{\mathbb{E}}[x]^{2} = \sigma^{2} =\text{"variance"}

The reciprocal of the variance, written as β=1σ2\beta = \frac{1}{\sigma^{2}}, is called the precision.

N\mathcal{N} defined over a D-dimensional vector x of continuous variables with the covariance Σ\Sigma is given by

N(xμ,Σ)=1(2π)D/21Σ1/2exp{12(xμ)TΣ1(xμ)}\mathcal{N}(x\mid\mu, \Sigma) = \frac{1}{(2\pi)^{D/2}}\frac{1}{\left|\Sigma\right|^{1/2}}exp\left\{-\frac{1}{2}(x - \mu)^{T}\Sigma^{-1}(x - \mu)\right\}

Suppose that we have a data set of N observation that are independent and identically distributed X\textbf{X}, the using the fact that the joint probability of two independet events is given by the product of marginal probabilities. The probability of the data set, given μ\mu and σ2\sigma^{2} is

p(Xμ,σ2)=n=1NN(xnμ,σ2)p(\textbf{X}\mid\mu, \sigma^{2}) = \prod^{N}_{n=1}\mathcal{N}(x_{n}|\mu, \sigma^{2})

Taking the log of the likelihhod function, results in the form

lnp(Xμ,σ2)=12σ2n=1N(xnμ)2N2lnσ2N2ln(2π)\begin{equation} ln\,p(\textbf{X}\mid\mu, \sigma^{2}) = -\frac{1}{2\sigma^{2}}\sum^{N}_{n=1}(x_{n} - \mu)^{2} - \frac{N}{2} ln\,\sigma^{2} - \frac{N}{2} ln\,(2\pi) \end{equation}

Sample Mean

Maximizing eq(3)eq(3) with respect to μ\mu, we obtain the maximum likelihood solution

μ^ML=arg maxμeq(3)    μeq(3)=0    n=1Nxn=Nμ=1Nn=1Nxn\begin{split} \hat{\mu}_{ML} &= \argmax_{\mu} eq(3) \implies\frac{\partial}{\partial\mu}eq(3) = 0 \implies \sum^{N}_{n=1}x_{n} = N\mu\\ &= \frac{1}{N}\sum^{N}_{n=1}x_{n} \end{split}

The μ^ML\hat{\mu}_{ML} is unbiased and it is often written as μ^\hat{\mu} which is called the sample mean.

E[μ^]=E[1Nn=1Nxn]=1Nn=1NE[xn]=μ\mathop{\mathbb{E}}[\hat{\mu}] = \mathop{\mathbb{E}}\bigg[\frac{1}{N}\sum^{N}_{n=1}x_{n}\bigg] = \frac{1}{N}\sum^{N}_{n=1}\mathop{\mathbb{E}}[x_{n}] = \mu

Sample Variance

Similarly, maximizing eq(3)eq(3) with respect to σ2\sigma^{2} , we obtain the maximum likelihood solution for the variance

σ^ML2=argmaxσ2eq(3)    σ2eq(3)=0    12(σ2)2n=1N(xnμ)2=N2σ2=1Nn=1N(xnμ^)2\begin{split} \hat{\sigma}^{2}_{ML} &= \arg\max_{\sigma^{2}}eq(3) \implies\frac{\partial}{\partial\sigma^{2}}eq(3) = 0 \implies \frac{1}{2(\sigma^{2})^{2}}\sum^{N}_{n=1}(x_{n} - \mu)^{2} = \frac{N}{2\sigma^{2}}\\ &= \frac{1}{N}\sum^{N}_{n=1}(x_{n} - \hat{\mu})^{2} \end{split}

σ^ML2\hat{\sigma}^{2}_{ML} is called biased sample variance and it is often written as s~2\tilde{s}^{2}

The maximum likelihood approach underestimates the variance of the distribution by a factor (N1)/N(N - 1) / N.

E[s~2]=E[1Ni=1N(xi1Nj=1Nxj)2]=1ni=1NE[xi22Nxij=1Nxj+1N2j=1Nxjk=1Nxk]=1ni=1N[N2NE[xi2]2NjiNE[xixj]+1N2j=1NkjNE[xjxk]+1N2j=1NE[xj2]]=1ni=1N[N2N(μ2+σ2)2N(N1)μ2+1N2N(N1)μ2+1N(μ2+σ2)]=N1Nσ2\begin{split} \mathop{\mathbb{E}}[\tilde{s}^{2}] & = \mathop{\mathbb{E}}\bigg[\frac{1}{N}\sum^{N}_{i=1}(x_{i} - \frac{1}{N}\sum^{N}_{j=1}x_{j})^{2}\bigg]\\ & = \frac{1}{n}\sum^{N}_{i=1}\mathop{\mathbb{E}}\bigg[x_{i}^{2} - \frac{2}{N}x_{i}\sum^{N}_{j=1}x_{j} + \frac{1}{N^{2}}\sum^{N}_{j=1}x_{j}\sum^{N}_{k=1}x_{k}\bigg]\\ & = \frac{1}{n}\sum^{N}_{i=1}\bigg[\frac{N-2}{N}\mathop{\mathbb{E}}[x_{i}^{2}] - \frac{2}{N}\sum^{N}_{j\neq i}\mathop{\mathbb{E}}[x_{i}x_{j}] + \frac{1}{N^{2}}\sum^{N}_{j=1}\sum^{N}_{k\neq j}\mathop{\mathbb{E}}[x_{j}x_{k}] + \frac{1}{N^{2}}\sum^{N}_{j=1}\mathop{\mathbb{E}}[x_{j}^{2}]\bigg]\\ & = \frac{1}{n}\sum^{N}_{i=1}\bigg[\frac{N-2}{N}(\mu^{2} + \sigma^{2}) - \frac{2}{N}(N - 1)\mu^{2} + \frac{1}{N^{2}}N(N - 1)\mu^{2} + \frac{1}{N}(\mu^{2} + \sigma^{2})\bigg]\\ & = \frac{N-1}{N}\sigma^{2}\\ \end{split}

It follows that the following estimate for the variance parameter is unbiased

s2=NN1s~2=1N1n=1N(xnμ^)2E[s2]=E[NN1s~2]=NN1E[s~2]=NN1N1Nσ2=σ2\begin{split} s^{2} &= \frac{N}{N-1}\tilde{s}^{2} = \frac{1}{N-1}\sum^{N}_{n=1}(x_{n} - \hat{\mu})^{2} \\ \mathop{\mathbb{E}}[s^{2}] &= \mathop{\mathbb{E}}[\frac{N}{N-1}\tilde{s}^{2}] = \frac{N}{N-1}\mathop{\mathbb{E}}[\tilde{s}^{2}] = \frac{N}{N-1}\frac{N-1}{N}\sigma^{2} = \sigma^{2} \\ \end{split}

s2s^{2} is called the unbiased sample variance.

The bias of the maximum likelihood solution becomes less significant as the number N of data points icreases, and in the limit NN \to \infty the maximum likelihood solution for the variance equals the true variance of the distribution that generated the data.

Regression

Suppose we observe a real-valued input variable xx and we wish to use this observation to predict the value of a real-valued target variable tt. Now suppose that we are given a training set comprising NN observations of xx, written x1,...,xNx_1,...,x_N, together with corresponding observations of the values of tt, denoted t1,...,tNt_1,...,t_N. Individual observations are corrupted by random noise. This noise might arise from intrinsically stochastic (i.e. random) processes such as radioactive decay but more typically is due to there being sources of variability that are themselves unobserved.

Our goal is to exploit this training set in order to make predictions of the value t^\hat{t} of the target variable for some new value x^{\hat{x}} of the input variable. This involves implicitly trying to discover the underlying function that generated the data. This is intrinsically a difficult problem as we have to generalize from a finite data set. Furthermore the observed data are corrupted with noise, and so for a given x^\hat{x} there is uncertainty as to the appropriate value for t^\hat{t}.

Polynomial Curve Fitting

If we fit the data using a polynomial function of the form

y(x,w)=w0+w1x+w2x2+...+wMxM=j=0Mwjxjy(x, \mathbf{w}) = w_0 + w_1 x + w_2 x_2 + ... + w_M x_M = \sum_{j=0}^{M} w_j x^j

where MM is the order of the polynomial, larger values of MM are becoming increasingly tuned to the random noise on the target values. The polynomial coefficients w0,...,wMw_0,...,w_M are collectively denoted by the vector w\mathbf{w}. Although the polynomial function y(x,w)y(x, \mathbf{w}) is a nonlinear function of xx, it is a linear function of the coefficients w\mathbf{w}. The values of the coefficients will be determined by fitting the polynomial to the training data. This can be done by minimizing an cost function that measures the error between the function y(x,w)y(x, \mathbf{w}), for any given value of w\mathbf{w}, and the training set data points. A widely used cost function is

J(w)=12i=1N(y(xi,w)ti)2\begin{equation} J(\mathbf{w}) = \frac{1}{2}\sum^{N}_{i=1}{(y(x_i, \mathbf{w}) − t_i)^2} \end{equation}

One technique to control the over-fitting phenomenon is that of regularization, which involves adding a penalty term to the cost function eq(4) in order to discourage the coefficients from reaching large values, leading to a modified error function

J~(w)=J(w)+λ2w2\begin{equation} \tilde{J}(\mathbf{w}) = J(\mathbf{w}) + \frac{\lambda}{2} ||\mathbf{w}||^2 \end{equation}

We assume that given the value of xx, the corresponding value of tt has a Gaussian distribution with a mean equal to the value y(x,w)y(x,\textbf{w}). Thus we can express our uncertainty over the value of the target variable using a probability distribution.

p(tx,w,β)=N(ty(x,w),β1)p(t\mid x, \textbf{w}, \beta) = \mathcal{N}(t\mid y(x, \textbf{w}), \beta^{-1})

We use the training data {x,w}\{\textbf{x},\textbf{w}\} to determine the values of the unknown parameters w\textbf{w} and β\beta by maximum likelihood. If the data are assumed to be drawn independently from the distribution, then the likelihood function is given by

p(tx,w,β)=n=1NN(tny(xn,w),β1)p(\textbf{t}\mid\textbf{x},\textbf{w},\beta) = \prod^{N}_{n=1}\mathcal{N}(t_{n}\mid y(x_{n}, \textbf{w}), \beta^{-1})

Substituting for the form of the Gaussian distribution, and take the logarithm

lnp(tx,w,β)=β2n=1N{y(xn,w)tn}2+N2lnβN2ln(2π)ln\,p(\textbf{t}\mid\textbf{x},\textbf{w},\beta) = -\frac{\beta}{2}\sum^{N}_{n=1}\mathcal\{y(x_{n}, \textbf{w}) - t_{n}\}^{2} + \frac{N}{2}ln\,\beta - \frac{N}{2}ln\,(2\pi)

Maximizes likelihood with respect to w\textbf{w} we can obtain the sum-of-squares-error-function defined by eq(4). Maximizing likehood with respect to β\beta gives

1β^=1Nn=1N{y(xn,w^)tn}2\frac{1}{\hat{\beta}} = \frac{1}{N}\sum^{N}_{n=1}\{y(x_{n}, \hat{\textbf{w}}) - t_{n}\}^{2}

Having determined the parameters w\textbf{w} and β\beta, we can express our probabilistic model in terms of the predictive distribution that gives the probability distribution over t.

p(tx,w^,β^)=N(ty(x,w^),β^1)p(t\mid x,\hat{\textbf{w}},\hat{\beta}) = \mathcal{N}(t\mid y(x, \hat{\textbf{w}}), \hat{\beta}^{-1})

We introduce a prior distribution over the polynomial coefficients w\textbf{w}. Here use a Gaussian distribution just for simplicity.

p(wα)=N(w0,α1I)=(α2π)(M+1)/2exp{α2wTw}p(\textbf{w}\mid\alpha) = \mathcal{N}(\textbf{w}\mid 0, \alpha^{-1}\textbf{I}) = \Big(\frac{\alpha}{2\pi}\Big)^{(M + 1) / 2} exp\big\{-\frac{\alpha}{2}\textbf{w}^{T}\textbf{w}\big\}

Where α\alpha is the precision of the distribution, and M+1M + 1 is the total number of elements in the vector w\textbf{w} for an MthM^{th} oder polynomial

Variables such as α\alpha, which control the distribution of model parameters, are called hyperparameters. Using Bayes's theorem

p(wx,t,α,β)p(tx,w,β)p(wα)\begin{equation} p(\textbf{w}\mid\textbf{x}, \textbf{t}, \alpha, \beta) \propto p(\textbf{t}\mid\textbf{x}, \textbf{w}, \beta)p(\textbf{w}\mid\alpha) \end{equation}

We can now determine w\textbf{w} by finding the most probable value of w\textbf{w} by maximizing the posterior distribution. Taking the negative logarithm of eq(6), we find that the maximum of the posterior is given by the minimum of

β2n=1N{y(xn,w^)tn}2+α2wTw\frac{\beta}{2}\sum^{N}_{n=1}\{y(x_{n}, \hat{\textbf{w}}) - t_{n}\}^{2} + \frac{\alpha}{2}\textbf{w}^{T}\textbf{w}

eq(5) is equivalent to minimize above equation with a regularization parameter given λ=αβ\lambda = \frac{\alpha}{\beta}

References

  • Pattern Recognition and Machine Learning, Chapter 1.1 Example: Polynomial Curve Fitting
  • Pattern Recognition and Machine Learning, Chapter 1.2 Example: Probability Theory