Linear Regression as MLE Under Gaussian Noise

by dokuDoku Statistics

🔍

## Maximum Likelihood Estimation of Data with Gaussian Noise

It is common for datasets to have an underlying linear relationship (with respect to the parameters) with some Gaussian Noise in the inputs.

Gaussians are widely used for their easy to work with properties such as
- Symmetry around the average $\mu$
- Shape and spread controlled by one parameter $\sigma$ (standard deviation)
    - 68% of data lies within $\mu \pm \sigma$
    - 95% of data lies within $\mu \pm 2\sigma$
    - 99.7% of data lies within $\mu \pm 3\sigma$
- Sum of Gaussians is a Gaussian 
    - $X \sim N(\mu_1, \sigma_1^2)$, $Y \sim N(\mu_2, \sigma_2^2)$, $X,Y$ independent, then $X + Y \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$
- Standardization, any Gaussian can be converted to a standard normal 
    - $Z = \frac{X - \mu}{\sigma}$, $Z \sim N(0,1)$
- Linear Transformations, if $X \sim N(\mu, \sigma^2)$ then for constants $a,b$ $aX+b \sim N(a\mu + b, a^2\sigma^2)$
- **Central Limit Theorem, the sums/averages of many independent random variables is approximately Gaussian**
    - Great statistical property! 
- Maximum Entropy, among all distributions with fixed mean & variance, Gaussian maximizes entropy and is least assumptive distribution under those constraints

We assume Gaussian noise, often justified by aggregation effects or maximum entropy arguments.

Specifically, we are going to find the maximum likelihood estimation of the following model with scalar outputs: 
$$ y_i = \beta^T x_i + \epsilon_i \quad \epsilon_i \sim N(0, \sigma^2)$$
- $y_i \in \mathbb{R}$ ground truth output for input $x_i$
- $x_i \in \mathbb{R}^d$ input 
- $\beta \in \mathbb{R}^d$ weights that we dot product with $x_i$ to give us our estimation $y_i$
- $\epsilon_i \in \mathbb{R}$ which is our noise term

Note that linear regression only requires the model to be *linear with respect to the model weights $\beta$*. It is very common to have a *basis function expansion* $\phi(x)$ that takes in individual $x_i$'s and transforms it to another space. For instance, an common one is the polynomial basis function 
$$ \phi(x) = [1, x, x^2, ..., x^d]$$

Making the model equation
$$ y_i = \beta^T \phi(x_i) + \epsilon_i \quad \epsilon_i \sim N(0, \sigma^2)$$

which we can see, is still linear with respect to $\beta$, but now we have much more expressiveness.

#### Quick Aside about Basis Function $\phi(x)$
For the basis function $\phi$ we are applying a map 
$$ \phi: \mathbb{R}^{d+1} \rightarrow \mathbb{R}^p$$
Embedding the input into another vector space. Note that our parameters $\beta$ need to live in the same range as $\phi$. The best mental model to have is that we are choosing a new basis for representing the data. This is important for when we have data that is non-linearly separable for example.

<img src="/media/images/20260421_012513_Screenshot_2026-04-20_at_6.24.22_PM.png" alt="Screenshot 2026 04 20 at 6.24.22 PM" style="max-width: min(800px, 90%); height: auto;">
<center>
$\textbf{Figure 1}$ Basis Function Applied to Non-Linear Data to make Linearly Separable for Linear Regression
</center>

As we can see in **Figure 1**, this is a classic example of using a Basis Function to make non-linearly classifiable data into linearly classifiable data, showing one example of this powerful technique.

On a much cooler note, **neural networks learn this feature map $\phi(x)$** automatically. This is very powerful since we may not know what this optimal function is for our specific dataset. This is out of scope for this blog post, but stay posted for a future blog post covering this!

### Derivation of Linear Regression, Least Squares Error, from Data with Gaussian Noise

Recall our model equation we are going to find the maximum likelihood of: 
$$ y_i = \beta^T x_i + \epsilon_i \quad \epsilon_i \sim N(0, \sigma^2)$$

We can see that $\beta^T x_i$ is deterministic, with random variable $\epsilon_i$. Giving us that $y_i = $ constant + Gaussian noise. We could consider $\beta^T x_i \sim N(\beta^T x_i, 0)$, which is a normal distribution with 0 variance (deterministic). Using the previously stated property that sum of Gaussians is a Gaussian, we get that

$$ y_i = \beta^Tx_i + \epsilon_i \rightarrow N(y_i | \beta^Tx_i, \sigma^2)$$

Great now let's find the maximum likelihood of $N(y_i | \beta^Tx_i, \sigma^2)$ for all datasamples $x_i$
$$\arg\max_{\beta} \left(\Pi_{i=1}^n N(y_i | \beta^Tx_i, \sigma^2) \right)$$

Since the logarithmic function is monotonically increasing, we know that the argmax of the original function, is the same as the argmax of the logarithm of the original function. Additionally, we know that $log(ab) = log(a) + log(b)$
$$\arg\max_{\beta} \sum_{i=1}^n log \left( N(y_i | \beta^Tx_i, \sigma^2) \right)$$

Note that the argmax is equal to the negative argmin, thus
$$- \arg\min_{\beta} \sum_{i=1}^n log \left( N(y_i | \beta^Tx_i, \sigma^2) \right)$$

Simplify using: 
$$ N(y_i | \beta^Tx_i, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} exp(-0.5\left(\frac{y_i - \beta^Tx_i}{\sigma}\right)^2)$$

giving

$$- \arg\min_{\beta} \sum_{i=1}^n log \left(  \frac{1}{\sigma \sqrt{2\pi}} exp(-0.5\left(\frac{y_i - \beta^Tx_i}{\sigma}\right)^2) \right)$$

$$ \arg\min_{\beta} \sum_{i=1}^n log(\frac{1}{\sigma \sqrt{2\pi}}) - 0.5\left(\frac{y_i - \beta^Tx_i}{\sigma} \right)^2 $$

$$
\arg\min_{\beta} nlog(\frac{1}{\sigma \sqrt{2\pi}}) -\frac{1}{2\sigma^2} \sum_{i=1}^n  \left(y_i - \beta^Tx_i\right)^2
$$

Recognize that $nlog(\frac{1}{\sigma \sqrt{2\pi}})$ is a constant term with respect to $\beta$ which we are optimizing for and that $\frac{1}{2 \sigma^2}$is a constant multiplier on the second term, not changing the argmax (think derivatives). Thus it is inconsequential in determining the argmax, giving us
$$ \arg\min_{\beta} \sum_{i=1}^n \left(y_i - \beta^Tx_i\right)^2$$

As we can see this is exactly the least squares error of ground truth $y_i$ against its prediction $\beta^Tx_i$. We can also rewrite this in vector form such that $Y$ is made up of all the $y_i$'s and $X$ is made up of all the $x_i$'s. 
$$\arg\min_{\beta} \sum_{i=1}^n \left(y_i - \beta^Tx_i\right)^2 = \arg\min_{\beta} (Y-X\beta)^T(Y - X\beta) = ||Y-X\beta||^2$$

Thus we have shown that when we find the maximum likelihood estimation of data with Gaussian noise this is equivalent to finding the optimal least squares solution, i.e. Linear Regression.

<img src="/media/images/20260421_020029_linear_regression_MLE_residuals_thumb.png" alt="linear regression MLE residuals thumb" style="max-width: min(800px, 100%); height: auto;">
<center>
$\textbf{Figure 2}$ Residual Showing that the Best Fitted Line gives Maximum Likelihood Estimation with Gaussian Noise
</center>

We can see that from **Figure 2** (also title image) that the residual (error) from estimating $y$ with $\hat{y} = \beta^Tx$ is Gaussian distributed given that the model is optimized with correct assumptions. This makes sense from the definition of our model 
$$ y_i = \beta^Tx_i + \epsilon_i \rightarrow N(y_i | \beta^Tx_i, \sigma^2)$$

since if $\beta$ is optimal, there will still be $\epsilon$ error, that's Gaussian Distributed.

<img src="/media/images/20260421_020421_linear_regression_MLE_normal.png" alt="linear regression MLE normal" style="max-width: min(800px, 100%); height: auto;">
<center>
$\textbf{Figure 3}$ Linear Regression with Gaussian Noise Data showing Best Fit Line with Gaussian around it showing Point Errors following Gaussian Distribution 
</center>

We can see in **Figure 3** what this means from a pragmatic standpoint where given that we have some linear model fitting some dataset that we can expect the least squares solution to also give the best estimate with respect to the Gaussian Distribution, and that the errors will follow the Gaussian distribution, as can be seen by the green shading.