Different Forms of Linear Regression

by dokuDoku Statistics

🔍

**Patrick Star**: You got it set to ‘M’ for mini, when it should be ‘W’ for wumbo.
*Turns the M on the belt upside down*
**SpongeBob SquarePants**: Patrick, I don't think "wumbo" is a real word.
**Patrick Star**: I wumbo, you wumbo, he, she, me, wumbo… wumbology—the study of wumbo. It’s first grade!...
**Spongebob SquarePants**: Patrick, sorry I doubted you.

-Patrick Star on how the same object can have multiple representations. 
 Mini and Wumbo.

---

## Different Formulations of Linear Regression

Just like how the letter **M** and **W** are really horizontal transformations of the same object, there are several representations of Linear Regression, or least squares optimization. There are several of these in the literature (and even on this blog) so this is a nice reference such that we can identify some of the common ones and relate them to each other.

The unifying principle across all formulations of linear regression stated here, is that we have an input vector $x \in \mathbb{R}^d$ that we dot with weight vector $w \in \mathbb{R}^d$ to give us estimate $\hat{y} \in \mathbb{R}$ of ground truth $y \in \mathbb{R}$

$$ \hat{y} = w^Tx $$
All formulations are linear maps applied to inputs, with an optional bias.

Which we optimize with least squares loss 
$$ || \hat{y} - y||^2 = ||w^Tx -y ||^2$$

The difference between the below expressions are 
- How many data samples $x$ are we considering at a time?
    - Do we stack each sample $x_i$ in the rows or columns of $X$?
- What if we have multiple dimensions for the output $y \in \mathbb{R}^k$?
    - Now we need to have $k$ $w$ weight vectors, one for each dimension 
- What if both cases occur?

### Single Input 
Consider that we only have one input $x \in \mathbb{R}^d$ that's of dimension $d$. We have that our estimate $\hat{y}$ is equal to 
$$ \hat{y} = w^Tx + b$$
Giving loss 
$$ ||w^Tx + b - y||^2$$
- $x \in \mathbb{R}^d$, input column vector $x$
- $w \in \mathbb{R}^d$, weight vector dot product'd with $x$ 
    - The parameters of our model 
- $b \in \mathbb{R}$, scalar bias for our model 
- $y \in \mathbb{R}$, scalar ground truth

#### Equivalent flipped form 
Consider that we can express the dot product $w^Tx$ as $x^Tw$, giving us
$$ \hat{y} = x^Tw + b$$
Giving loss 
$$ ||x^Tw + b - y||^2$$
- $x \in \mathbb{R}^d$, input column vector $x$
- $w \in \mathbb{R}^d$, weight vector dot product'd with $x$ 
    - The parameters of our model 
- $b \in \mathbb{R}$, scalar bias for our model 
- $y \in \mathbb{R}$, scalar ground truth

#### Norm Form 
Consider that we include the bias term as part of the weights $w$ and add an extra 1 to $x$ to compensate. This gives us the cleaner forms 
$$ \hat{y} = x^Tw $$
Giving loss 
$$ ||w^Tx - y||^2$$
and equivalently 
$$ ||x^Tw - y||^2$$
- $x \in \mathbf{\mathbb{R}^{d+1}}$, input column vector $x$
- $w \in \mathbf{\mathbb{R}^{d+1}}$, weight vector dot product'd with $x$ 
    - The parameters of our model 
- $b \in \mathbb{R}$, scalar bias for our model 
- $y \in \mathbb{R}$, scalar ground truth

### Matrix & Batch Formulation 
Suppose that now we have $n$ inputs, instead of just one.

#### Each Row is a Sample
This convention is very common in machine learning and statistics. The data is entered as rows, so that if we have $n$ data samples, each with degree $d$ the shape of our data is 
$$ X \in \mathbb{R}^{n \times d} \rightarrow 
X = \begin{bmatrix}
-\hspace{0.5em}\mathbf{x}_1\hspace{0.5em}- \\
-\hspace{0.5em}\mathbf{x}_2\hspace{0.5em}- \\
\vdots \\
-\hspace{0.5em}\mathbf{x}_n\hspace{0.5em}-
\end{bmatrix}
$$
Assuming that $w \in \mathbb{R}^d$ we have that our prediction $Xw$ will be every row of $X$ dot producted with $w$, giving us $n$ predictions, one for each data entry.

$$
\begin{bmatrix}
-\hspace{0.5em}\mathbf{x}_1\hspace{0.5em}- \\
-\hspace{0.5em}\mathbf{x}_2\hspace{0.5em}- \\
\vdots \\
-\hspace{0.5em}\mathbf{x}_n\hspace{0.5em}-
\end{bmatrix}

\begin{bmatrix}
w_1 \\
w_2 \\
\vdots \\
w_d
\end{bmatrix}
=
\begin{bmatrix}
\mathbf{x}_1^\top \mathbf{w} \\
\mathbf{x}_2^\top \mathbf{w} \\
\vdots \\
\mathbf{x}_n^\top \mathbf{w}
\end{bmatrix}
\in \mathbb{R}^{n}
$$

Given that $y \in \mathbb{R}^n$, which is a scalar output for each input in $X$, we have that

$$\hat{y} = Xw \in \mathbb{R}^n$$
Giving loss 
$$ ||Xw - y||^2$$
- $X \in \mathbb{R}^{n \times d}$
- $w \in \mathbb{R}^d$
- $y \in \mathbb{R}^n$

#### Each Column is a Sample

Now suppose that $X$ has every column as a sample instead of every row. For $n$ data samples, each of degree $d$, we have

$$
X \in \mathbb{R}^{d \times n} \rightarrow
X = \begin{bmatrix}
\mid & \mid & & \mid \\
\mathbf{x}_1 & \mathbf{x}_2 & \cdots & \mathbf{x}_n \\
\mid & \mid & & \mid
\end{bmatrix}
\in \mathbb{R}^{d \times n}
$$

Keeping $w \in \mathbb{R}^d$ as a scalar we have now that our predictions are $w^TX$ so that we are taking the dot product of $w$ with each column (data entry) of $X$

$$
\begin{bmatrix} w_1 & w_2 & \cdots & w_d \end{bmatrix}
\begin{bmatrix}
\mid & \mid & & \mid \\
\mathbf{x}_1 & \mathbf{x}_2 & \cdots & \mathbf{x}_n \\
\mid & \mid & & \mid
\end{bmatrix}
= 
\begin{bmatrix} w^T \mathbf{x}_1 & w^T \mathbf{x}_2 & \cdots & w^T \mathbf{x}_n \end{bmatrix}
\in \mathbb{R}^{1 \times n}
$$

Givent that $y \in \mathbb{R}^n$, which is a scalar output for each input in $X$, we have that 
$$\hat{y} = w^TX \in \mathbb{R}^{1 \times n}$$

Giving loss 
$$||w^TX - y^T||^2 \leftrightarrow ||X^Tw - y||^2$$
- $X \in \mathbb{R}^{d \times n}$
- $w \in \mathbb{R}^d$
- $y \in \mathbb{R}^n$

### Outputs as vectors

Consider now that we have $y \in \mathbb{R}^k$ for each data entry $x \in \mathbb{R}^d$. This means that we must have $W$ as a matrix $W \in \mathbb{R}^{k \times d}$ such that 
$$ \hat{y} = Wx + b \in \mathbb{R}^k$$
and $b \in \mathbb{R}^k$

Giving loss 
$$ ||Wx + b - y||^2$$
- $W \in \mathbb{R}^{k \times d}$
- $x \in \mathbb{R}^d$ column vector 
- $Wx \in \mathbb{R}^k$
- $y \in \mathbb{R}^k$
- $b \in \mathbb{R}^k$ the bias of degree $k$

#### With multiple data entries $X$
Suppose that we want to calculate a vector output $y \in \mathbb{R}^k$ with $n$ inputs. There are two cases

#### Each Row is a Sample
We have $n$ data samples, where each sample is a row vector of dimension $d$.

$$
X \in \mathbb{R}^{n \times d}, \qquad
X =
\begin{bmatrix}
-\,\mathbf{x}_1\,- \\
-\,\mathbf{x}_2\,- \\
\vdots \\
-\,\mathbf{x}_n\,-
\end{bmatrix}
$$

where each $\mathbf{x}_i \in \mathbb{R}^{1 \times d}$.

We then define a weight matrix $W \in \mathbb{R}^{d \times k}$, where each column is a different weight vector:

$$
W =
\begin{bmatrix}
\mid & \mid & & \mid \\
\mathbf{w}_1 & \mathbf{w}_2 & \cdots & \mathbf{w}_k \\
\mid & \mid & & \mid
\end{bmatrix}
\in \mathbb{R}^{d \times k}
$$

Thus, our prediction matrix is

$$
XW =
\begin{bmatrix}
-\,\mathbf{x}_1\,- \\
-\,\mathbf{x}_2\,- \\
\vdots \\
-\,\mathbf{x}_n\,-
\end{bmatrix}
\begin{bmatrix}
\mid & \mid & & \mid \\
\mathbf{w}_1 & \mathbf{w}_2 & \cdots & \mathbf{w}_k \\
\mid & \mid & & \mid
\end{bmatrix}
$$

and the $i$-th row of the result is

$$
\mathbf{x}_i W
=
\begin{bmatrix}
\mathbf{x}_i \mathbf{w}_1 & \mathbf{x}_i \mathbf{w}_2 & \cdots & \mathbf{x}_i \mathbf{w}_k
\end{bmatrix}.
$$

So the full product is

$$
XW
=
\begin{bmatrix}
\mathbf{x}_1 \mathbf{w}_1 & \mathbf{x}_1 \mathbf{w}_2 & \cdots & \mathbf{x}_1 \mathbf{w}_k \\
\mathbf{x}_2 \mathbf{w}_1 & \mathbf{x}_2 \mathbf{w}_2 & \cdots & \mathbf{x}_2 \mathbf{w}_k \\
\vdots & \vdots & & \vdots \\
\mathbf{x}_n \mathbf{w}_1 & \mathbf{x}_n \mathbf{w}_2 & \cdots & \mathbf{x}_n \mathbf{w}_k
\end{bmatrix}
\in \mathbb{R}^{n \times k}.
$$

That is, each sample $\mathbf{x}_i$ is mapped to a $k$-dimensional output vector, where the $j$-th component is given by the dot product $\mathbf{x}_i \mathbf{w}_j$.

Our prediction $\hat{y} = XW \in \mathbb{R}^{n \times k}$ yields the following loss
$$ || XW - Y ||^2 \leftrightarrow ||W^TX = Y^T||^2 $$
- $X \in \mathbb{R}^{n \times d}$ where every row is a data entry of dimension $d$
- $W \in \mathbb{R}^{d \times k}$ that maps from a vector space in $\mathbb{R}^d \rightarrow \mathbb{R}^k$
    - $W: \mathbb{R}^d \rightarrow \mathbb{R}^k$ is the learned linear mapping
- $Y \in \mathbb{R}^{n \times k}$ for each of the $n$ entries in $X$ mapped to $\mathbb{R}^k$