Normal Distribution

Normal Distribution#

Toy Example#

Let’s look at a toy example of Normal Distribution. Assume we have a dataset containing data on the height of populations in a city, whose frequency histogram is in the following figure.

Data Source

../_images/ad6ce109cdd383863cd7375e1e7584065b9ace933cf9bef1478c1e501b618d22.png

As part of our modeling approach, we hypothesize that the actual distribution of this height data follows a Normal Distribution characterized by two parameters: a mean \(\mu\) and a standard deviation \(\sigma\). Our next step is to determine the optimal values for \(\mu\) and \(\sigma\) that best align the normal distribution with the existing height data, as depicted by the red curve in the figure below.

\[p(x) = \mathcal{N}(x; \mu, \sigma)\]

../_images/50415683b2394925e2b430c55eeecb9750e704cb423b4430838ca60411f6528d.png

We will introduce the parameter estimation method later, but now, let’s assume that we have obtained the estimated distribution as the red curve in the figure. From this distribution, we can generate new (pseudo) data that look similar to the original ones by sampling.

../_images/9e1279232b75ec215ffe5e75aeecd955ae97991510cea3cfb7283d8171d14ee7.png

Maximum Likelihood Estimation#

A method to estimate the distribution (model) parameters is Maximum Likelihood Estimation (MLE). This method aims to find the parameters that make the model fit the best to the existing samples we have in hand.

Assume that we have \(N\) samples of the observed data \(\mathcal{D}=\{x^{(1)}, x^{(2)}, ..., x^{(N)}\}\). Considering the assumption that each sample is independently sampled from the distribution \(p(x;\theta)\), where \(\theta\) is the model’s parameters. We can express the probability density of the sample set \(\mathcal{D}\) as follows.

\[p(\mathcal{D};\theta) = \prod_{n=1}^{N}{p(x^{(n)};\theta)}\]

We can consider this probability density as a function of \(\theta\) as the following. This \(L(\theta)\) is called the Likelihood or Likelihood Function.

\[L(\theta) = p(\mathcal{D};\theta) \]

In MLE, we aim to find \(\theta = \hat{\theta}\) that maximizes the likelihood function \(L(\theta)\). However, in most applications, we usually use the Log-Likelihood Function instead of directly maximizing the likelihood function, as its calculation is much more convenient. The parameters obtained with the maximizing likelihood function and log-likelihood function are equivalent to the \(\log\) function, which is a monotonically increasing function.

\[\begin{split}\begin{align*} \log{L(\theta)} &= \log{p(\mathcal{D};\theta)}\\ &=\log{\prod_{n=1}^{N}{p(x^{(n)};\theta)}} \\ &= \sum_{n=1}^{N}{\log{p(x^{(n)};\theta)}} \end{align*}\end{split}\]

Thus, the model parameters can be obtained as follows.

\[\hat{\theta} = \arg\max_\theta \log L(\theta)\]

Parameter Estimation for Normal Distribution

The log-likelihood function of the normal distribution function can be obtained as follows.

\[\begin{split}\begin{align*} L(\mu,\sigma) &= \sum_{n=1}^N\log \mathcal{N}(x^{(n)};\mu,\sigma) \\ &= \sum_{n=1}^N\log \left[\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x^{(n)}-\mu)}{2\sigma^2}\right)\right] \\ &= -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{n=1}^N\left(x^{(n)}-\mu\right)^2 \end{align*}\end{split}\]

\(L(\mu,\sigma)\) takes the maximum value when \(\frac{\partial L}{\partial \mu} = 0, \frac{\partial L}{\partial \sigma} = 0 \). Analytically solving these equations, we can obtain the parameter \(\hat{\mu}, \hat{\sigma}\) as follows.

\[\begin{split}\begin{align*} \hat{\mu} &= \frac{1}{N}\sum_{n=1}^Nx^{(n)} \\ \hat{\sigma}&=\sqrt{\frac{1}{N}\sum_{n=1}^N\left(x^{(n)}-\hat\mu\right)^2} \end{align*}\end{split}\]

Normal Distribution

Contents

Normal Distribution#

Toy Example#

Maximum Likelihood Estimation#

#