Probability

Expectations

The average value of some function $f(x)$ under a probability distribution $p(x)$ is called the expectation of $f(x)$ .

For a discrete distribution

\mathop{\mathbb{E}}[f] = \sum_{x}p(x)f(x)

For a continuous distribution

\mathop{\mathbb{E}}[f] = \int p(x)f(x)dx

If we are given a finite number $N$ of points drawn from the probability distribution or probability density, the expectation can be approximated as a finite sum over these points. The approximation becomes exact in the limit $N \to \infty$ :

\mathop{\mathbb{E}}[f] \simeq \frac{1}{N}\sum^{N}_{n=1}{f(x_{n})}

Sometimes we will be considering expectations of functions of several variables, in which case we can use a subscript to indicate which variable is being averaged over a function of several variables. So the expectation of the function $f(x, y)$ with respect to the distribution of $x$ is denoted by

$\mathop{\mathbb{E}_{x}}[f(x, y)]$

Note the expectation will be a function of $y$ . And we use $\mathop{\mathbb{E}_{x}}[f\mid y]$ to denote a conditional expectation with repect to a conditional distribution.

\mathop{\mathbb{E}_{x}}[f\mid y] = \sum_{x} p(x\mid y)f(x) = \int{p(x\mid y)f(x)\,dx}

Bias

The bias of an estimator $\hat{\theta}$ is defined as

B(\hat{\theta}) = \mathop{\mathbb{E}}[\hat{\theta} - \theta] = \mathop{\mathbb{E}}[\hat{\theta}] - \theta

The estimator $\hat{\theta}$ is an unbiased estimator of $\theta$ if and only if $B(\hat{\theta}) = 0$

Covariances

The covariance and variance is defined by

\begin{equation} \begin{split} cov(f,g) & = \mathop{\mathbb{E}_{x, y}}\big[(f(x) - \mathop{\mathbb{E}}[f(x)])(g(y) - \mathop{\mathbb{E}}[g(y)])\big]\\ & = \mathop{\mathbb{E}_{x, y}}\big[(f(x)g(y)\big] - \mathop{\mathbb{E}_{x, y}}\big[f(x)\mathop{\mathbb{E}}[g(y)]\big] - \mathop{\mathbb{E}_{x, y}}\big[g(y)\mathop{\mathbb{E}}[f(x)]\big] + \mathop{\mathbb{E}_{x, y}}\big[\mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)]\big]\\ & = \mathop{\mathbb{E}_{x, y}}[(f(x)g(y)] - \mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)] - \mathop{\mathbb{E}}[g(y)]\mathop{\mathbb{E}}[f(x)] + \mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)]\\ & = \mathop{\mathbb{E}_{x, y}}[(f(x)g(y)] - \mathop{\mathbb{E}}[f(x)]\mathop{\mathbb{E}}[g(y)]\\ \end{split} \end{equation}

var(f) = \mathop{\mathbb{E}}\big[\big(f(x) - \mathop{\mathbb{E}}[f(x)]\big)^{2}\big]

var(f) = \mathop{\mathbb{E}}[(f(x)^{2}] - \mathop{\mathbb{E}}[f(x)]^{2}

If $f(x) = x$ , $g(y) = y$

var(x) = cov(x, x) = \mathop{\mathbb{E}}[x^{2}] - \mathop{\mathbb{E}}[x]^{2}

cov(x, y) = \mathop{\mathbb{E}_{x, y}}[xy] - \mathop{\mathbb{E}}[x]\mathop{\mathbb{E}}[y]

for $\textbf{x}\in R^{m}$ and $\textbf{y}\in R^{n}$ , the result is a matrix.

cov(\textbf{x}, \textbf{y}) = \mathop{\mathbb{E}_{\textbf{x}, \textbf{y}}}[\textbf{x}\textbf{y}^{T}] - \mathop{\mathbb{E}}[\textbf{x}]\mathop{\mathbb{E}}[\textbf{y}]^{T}

The covariance matrix generalizes the notion of variance to multiple dimensions.

\begin{equation} \begin{split} \Sigma(\textbf{x}) & = cov(\textbf{x}, \textbf{x})\\ & = \begin{bmatrix} \mathop{\mathbb{E}}\bigg[\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\bigg] & \mathop{\mathbb{E}}\bigg[\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\bigg] & \dots & \mathop{\mathbb{E}}\bigg[\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\bigg]\\ \mathop{\mathbb{E}}\bigg[\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\bigg] & \mathop{\mathbb{E}}\bigg[\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\bigg] & \dots & \mathop{\mathbb{E}}\bigg[\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\bigg] \\ \vdots & \vdots & \ddots & \vdots \\ \mathop{\mathbb{E}}\bigg[\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\Big(x_{1} - \mathop{\mathbb{E}}[x_{1}]\Big)\bigg] & \mathop{\mathbb{E}}\bigg[\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\Big(x_{2} - \mathop{\mathbb{E}}[x_{2}]\Big)\bigg] & \dots & \mathop{\mathbb{E}}\bigg[\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\Big(x_{n} - \mathop{\mathbb{E}}[x_{n}]\Big)\bigg] \end{bmatrix} \end{split} \end{equation}

Bayesian Probabilities

Bayes's theorem allows us to evaluate the uncertainty in $\textbf{w}$ in the form of the posterior probability $p(\textbf{w}\mid \mathcal{D})$ after we have incorporated the evidence provided by the observed data $\mathcal{D}$ .

p(\textbf{w}\mid \mathcal{D}) = \frac{p(\mathcal{D}\mid\textbf{w})p(\textbf{w})}{p(\mathcal{D})}

The quantity $p(\mathcal{D}\mid\textbf{w})$ is called the likelihood function. It expresses how probable the observed data ser is for the specified parameter vector $\textbf{w}$ . $p(\textbf{w})$ is the prior probability distribution of $\textbf{w}$ before observing the data. We can state Bayes's theorem in words

posterior \propto likelihood \times prior

The denominator is the normalization constant. which ensures that the posterior distribution integrates to one.

p(\mathcal{D}) = \int p(\mathcal{D}\mid\textbf{w})p(\textbf{w})d\textbf{w}

In a frequentist setting, $\textbf{w}$ is determined by some form of "estimator". A widely used one is maximum likelihood, in which $\textbf{w}$ is set to the value that maximizes the likelihood function $p(\mathcal{D}\mid\textbf{w})$ .

The Gaussian Distribution

For the case of a single real-value variable $x$ , the Gaussian distribution is defined by

\mathcal{N}(x\mid\mu, \sigma^{2}) = \frac{1}{(2\pi\sigma^{2})^{1/2}}exp\left\{-\frac{1}{2\sigma^{2}}(x - \mu)^{2}\right\}

\mathop{\mathbb{E}}[x] = \mu = \text{"mean"}

\mathop{\mathbb{E}}[x^{2}] = \mu^{2} + \sigma^{2}

var[x] = \mathop{\mathbb{E}}[x^{2}] - \mathop{\mathbb{E}}[x]^{2} = \sigma^{2} =\text{"variance"}

The reciprocal of the variance, written as $\beta = \frac{1}{\sigma^{2}}$ , is called the precision.

$\mathcal{N}$ defined over a D-dimensional vector x of continuous variables with the covariance $\Sigma$ is given by

\mathcal{N}(x\mid\mu, \Sigma) = \frac{1}{(2\pi)^{D/2}}\frac{1}{\left|\Sigma\right|^{1/2}}exp\left\{-\frac{1}{2}(x - \mu)^{T}\Sigma^{-1}(x - \mu)\right\}

Suppose that we have a data set of N observation that are independent and identically distributed $\textbf{X}$ , the using the fact that the joint probability of two independet events is given by the product of marginal probabilities. The probability of the data set, given $\mu$ and $\sigma^{2}$ is

p(\textbf{X}\mid\mu, \sigma^{2}) = \prod^{N}_{n=1}\mathcal{N}(x_{n}|\mu, \sigma^{2})

Taking the log of the likelihhod function, results in the form

\begin{equation} ln\,p(\textbf{X}\mid\mu, \sigma^{2}) = -\frac{1}{2\sigma^{2}}\sum^{N}_{n=1}(x_{n} - \mu)^{2} - \frac{N}{2} ln\,\sigma^{2} - \frac{N}{2} ln\,(2\pi) \end{equation}

Sample Mean

Maximizing $eq(3)$ with respect to $\mu$ , we obtain the maximum likelihood solution

\begin{split} \hat{\mu}_{ML} &= \argmax_{\mu} eq(3) \implies\frac{\partial}{\partial\mu}eq(3) = 0 \implies \sum^{N}_{n=1}x_{n} = N\mu\\ &= \frac{1}{N}\sum^{N}_{n=1}x_{n} \end{split}

The $\hat{\mu}_{ML}$ is unbiased and it is often written as $\hat{\mu}$ which is called the sample mean.

\mathop{\mathbb{E}}[\hat{\mu}] = \mathop{\mathbb{E}}\bigg[\frac{1}{N}\sum^{N}_{n=1}x_{n}\bigg] = \frac{1}{N}\sum^{N}_{n=1}\mathop{\mathbb{E}}[x_{n}] = \mu

Sample Variance

Similarly, maximizing $eq(3)$ with respect to $\sigma^{2}$ , we obtain the maximum likelihood solution for the variance

\begin{split} \hat{\sigma}^{2}_{ML} &= \arg\max_{\sigma^{2}}eq(3) \implies\frac{\partial}{\partial\sigma^{2}}eq(3) = 0 \implies \frac{1}{2(\sigma^{2})^{2}}\sum^{N}_{n=1}(x_{n} - \mu)^{2} = \frac{N}{2\sigma^{2}}\\ &= \frac{1}{N}\sum^{N}_{n=1}(x_{n} - \hat{\mu})^{2} \end{split}

$\hat{\sigma}^{2}_{ML}$ is called biased sample variance and it is often written as $\tilde{s}^{2}$

The maximum likelihood approach underestimates the variance of the distribution by a factor $(N - 1) / N$ .

\begin{split} \mathop{\mathbb{E}}[\tilde{s}^{2}] & = \mathop{\mathbb{E}}\bigg[\frac{1}{N}\sum^{N}_{i=1}(x_{i} - \frac{1}{N}\sum^{N}_{j=1}x_{j})^{2}\bigg]\\ & = \frac{1}{n}\sum^{N}_{i=1}\mathop{\mathbb{E}}\bigg[x_{i}^{2} - \frac{2}{N}x_{i}\sum^{N}_{j=1}x_{j} + \frac{1}{N^{2}}\sum^{N}_{j=1}x_{j}\sum^{N}_{k=1}x_{k}\bigg]\\ & = \frac{1}{n}\sum^{N}_{i=1}\bigg[\frac{N-2}{N}\mathop{\mathbb{E}}[x_{i}^{2}] - \frac{2}{N}\sum^{N}_{j\neq i}\mathop{\mathbb{E}}[x_{i}x_{j}] + \frac{1}{N^{2}}\sum^{N}_{j=1}\sum^{N}_{k\neq j}\mathop{\mathbb{E}}[x_{j}x_{k}] + \frac{1}{N^{2}}\sum^{N}_{j=1}\mathop{\mathbb{E}}[x_{j}^{2}]\bigg]\\ & = \frac{1}{n}\sum^{N}_{i=1}\bigg[\frac{N-2}{N}(\mu^{2} + \sigma^{2}) - \frac{2}{N}(N - 1)\mu^{2} + \frac{1}{N^{2}}N(N - 1)\mu^{2} + \frac{1}{N}(\mu^{2} + \sigma^{2})\bigg]\\ & = \frac{N-1}{N}\sigma^{2}\\ \end{split}

It follows that the following estimate for the variance parameter is unbiased

\begin{split} s^{2} &= \frac{N}{N-1}\tilde{s}^{2} = \frac{1}{N-1}\sum^{N}_{n=1}(x_{n} - \hat{\mu})^{2} \\ \mathop{\mathbb{E}}[s^{2}] &= \mathop{\mathbb{E}}[\frac{N}{N-1}\tilde{s}^{2}] = \frac{N}{N-1}\mathop{\mathbb{E}}[\tilde{s}^{2}] = \frac{N}{N-1}\frac{N-1}{N}\sigma^{2} = \sigma^{2} \\ \end{split}

$s^{2}$ is called the unbiased sample variance.

The bias of the maximum likelihood solution becomes less significant as the number N of data points icreases, and in the limit $N \to \infty$ the maximum likelihood solution for the variance equals the true variance of the distribution that generated the data.

Regression

Suppose we observe a real-valued input variable $x$ and we wish to use this observation to predict the value of a real-valued target variable $t$ . Now suppose that we are given a training set comprising $N$ observations of $x$ , written $x_1,...,x_N$ , together with corresponding observations of the values of $t$ , denoted $t_1,...,t_N$ . Individual observations are corrupted by random noise. This noise might arise from intrinsically stochastic (i.e. random) processes such as radioactive decay but more typically is due to there being sources of variability that are themselves unobserved.

Our goal is to exploit this training set in order to make predictions of the value $\hat{t}$ of the target variable for some new value ${\hat{x}}$ of the input variable. This involves implicitly trying to discover the underlying function that generated the data. This is intrinsically a difficult problem as we have to generalize from a finite data set. Furthermore the observed data are corrupted with noise, and so for a given $\hat{x}$ there is uncertainty as to the appropriate value for $\hat{t}$ .

Polynomial Curve Fitting

If we fit the data using a polynomial function of the form

y(x, \mathbf{w}) = w_0 + w_1 x + w_2 x_2 + ... + w_M x_M = \sum_{j=0}^{M} w_j x^j

where $M$ is the order of the polynomial, larger values of $M$ are becoming increasingly tuned to the random noise on the target values. The polynomial coefficients $w_0,...,w_M$ are collectively denoted by the vector $\mathbf{w}$ . Although the polynomial function $y(x, \mathbf{w})$ is a nonlinear function of $x$ , it is a linear function of the coefficients $\mathbf{w}$ . The values of the coefficients will be determined by fitting the polynomial to the training data. This can be done by minimizing an cost function that measures the error between the function $y(x, \mathbf{w})$ , for any given value of $\mathbf{w}$ , and the training set data points. A widely used cost function is

\begin{equation} J(\mathbf{w}) = \frac{1}{2}\sum^{N}_{i=1}{(y(x_i, \mathbf{w}) − t_i)^2} \end{equation}

One technique to control the over-fitting phenomenon is that of regularization, which involves adding a penalty term to the cost function eq(4) in order to discourage the coefficients from reaching large values, leading to a modified error function

\begin{equation} \tilde{J}(\mathbf{w}) = J(\mathbf{w}) + \frac{\lambda}{2} ||\mathbf{w}||^2 \end{equation}

We assume that given the value of $x$ , the corresponding value of $t$ has a Gaussian distribution with a mean equal to the value $y(x,\textbf{w})$ . Thus we can express our uncertainty over the value of the target variable using a probability distribution.

p(t\mid x, \textbf{w}, \beta) = \mathcal{N}(t\mid y(x, \textbf{w}), \beta^{-1})

We use the training data $\{\textbf{x},\textbf{w}\}$ to determine the values of the unknown parameters $\textbf{w}$ and $\beta$ by maximum likelihood. If the data are assumed to be drawn independently from the distribution, then the likelihood function is given by

p(\textbf{t}\mid\textbf{x},\textbf{w},\beta) = \prod^{N}_{n=1}\mathcal{N}(t_{n}\mid y(x_{n}, \textbf{w}), \beta^{-1})

Substituting for the form of the Gaussian distribution, and take the logarithm

ln\,p(\textbf{t}\mid\textbf{x},\textbf{w},\beta) = -\frac{\beta}{2}\sum^{N}_{n=1}\mathcal\{y(x_{n}, \textbf{w}) - t_{n}\}^{2} + \frac{N}{2}ln\,\beta - \frac{N}{2}ln\,(2\pi)

Maximizes likelihood with respect to $\textbf{w}$ we can obtain the sum-of-squares-error-function defined by eq(4). Maximizing likehood with respect to $\beta$ gives

\frac{1}{\hat{\beta}} = \frac{1}{N}\sum^{N}_{n=1}\{y(x_{n}, \hat{\textbf{w}}) - t_{n}\}^{2}

Having determined the parameters $\textbf{w}$ and $\beta$ , we can express our probabilistic model in terms of the predictive distribution that gives the probability distribution over t.

p(t\mid x,\hat{\textbf{w}},\hat{\beta}) = \mathcal{N}(t\mid y(x, \hat{\textbf{w}}), \hat{\beta}^{-1})

We introduce a prior distribution over the polynomial coefficients $\textbf{w}$ . Here use a Gaussian distribution just for simplicity.

p(\textbf{w}\mid\alpha) = \mathcal{N}(\textbf{w}\mid 0, \alpha^{-1}\textbf{I}) = \Big(\frac{\alpha}{2\pi}\Big)^{(M + 1) / 2} exp\big\{-\frac{\alpha}{2}\textbf{w}^{T}\textbf{w}\big\}

Where $\alpha$ is the precision of the distribution, and $M + 1$ is the total number of elements in the vector $\textbf{w}$ for an $M^{th}$ oder polynomial

Variables such as $\alpha$ , which control the distribution of model parameters, are called hyperparameters. Using Bayes's theorem

\begin{equation} p(\textbf{w}\mid\textbf{x}, \textbf{t}, \alpha, \beta) \propto p(\textbf{t}\mid\textbf{x}, \textbf{w}, \beta)p(\textbf{w}\mid\alpha) \end{equation}

We can now determine $\textbf{w}$ by finding the most probable value of $\textbf{w}$ by maximizing the posterior distribution. Taking the negative logarithm of eq(6), we find that the maximum of the posterior is given by the minimum of

\frac{\beta}{2}\sum^{N}_{n=1}\{y(x_{n}, \hat{\textbf{w}}) - t_{n}\}^{2} + \frac{\alpha}{2}\textbf{w}^{T}\textbf{w}

eq(5) is equivalent to minimize above equation with a regularization parameter given $\lambda = \frac{\alpha}{\beta}$

References

Pattern Recognition and Machine Learning, Chapter 1.1 Example: Polynomial Curve Fitting
Pattern Recognition and Machine Learning, Chapter 1.2 Example: Probability Theory