Cross Entropy Method: Sample Based Model Predictive Control

This is a quick summary of my finding on CEM (Cross Entropy Method) meant to give a quick overview from [this blog](https://sanyamkapoor.com/kb/deriving-the-cross-entropy-method/). This in general seems related to reinforcement learning methods and Markov Decision Processes (MDP's) trying to optimize a trajectory. In this case CEM considers finite horizon control.

In general, MPC (Model Predictive Control) is a control strategy that predicts future states of a system over a finite horizon using a dynamics model optimizing a sequence of control actions to minimize a cost function. Typically this is then applied to the first action in the sequence and we repeat. Because we repeat every time step, this is receding horizon control.

Traditionally, MPC relies on gradient based optimization, which can be computationally expensive or get stuck in local minima (not really an issue for larger neural networks). Sample-based MPC (like CEM) instead *generates random samples of action sequences, evaluates them, and selects/refines the best ones*. It's useful for high dimensional & complex problems.

### Cross Entropy Method as Sample Based MPC
Cross Entropy Method (CEM) searches for the optimal action sequences by sampling from a probability distribution. It then iteratively refines the distribution focusing on the highest performing samples. Thus it's a form of importance sampling & evolutionary strategy.

Some key advantages of Cross Entropy Method are as follows: 
- Good for nonlinear & stochastic environments
- Parallelization
- Starts broad and narrows in
- Efficient

Some cons:
- Computationally expensive for high dimensional actions or long horizons
- Good dynamics model required
- Hyperparameter tuning required

### The Algorithm
This is how we can apply CEM in a sample based MPC loop. To start we have a dynamics model $x_{t+1} = f(x_t,u_t) + w_t$ s.t. x is state, u is action/input, & w is noise. The cost function is: $J(u) = \sum_{t=0}^{H-1} c(x_t,u_t)$ over horizon H. The initial distribution over action sequences $U = [u_0, u_1, ... u_{H-1}]$ is usually a Gaussian $\mathcal{N}(\mu, \Sigma)$ with mean $\mu$ (prev optimal sequence) and covariance $\Sigma$ for exploration.

For initialization we set the distribution typically to have $\mu_0$ as zero or last MPC output and $\Sigma_0$ as a diagonal matrix, so no two parameters have covariance. We also will need to determine hyperparameters.

To sample, we will draw K actions sequences $U^1, U^2,...U^K$ from the current distribution $p(U) = \mathcal{N}(\mu, \Sigma)$.

To evaluate, for each sample $U^k$ simulate the system forward from the current state $x_0$ using the dynamics model to get predicted states $x_1^k,...x_H^k$. Additionally compute the cost $J^k$ wrt each trajectory. For stochastic systems, average over multiple rollouts per sample.

To select, sort the samples by their costs $J^k$ in ascending order, since lower cost is better, and select the top samples. Then we update the distribution by recomputing the mean and covariance from the best samples: 
- New mean: $\mu' = \frac{1}{|best\ samples|} \sum_{best\ samples} U^k$
- New covariance: $\Sigma' = \frac{1}{|best\ samples|} \sum_{best\ samples} (U^k - \mu')(U^k - \mu')^T$

Optionally, smoothing can be added to prevent abrupt changes like $\mu_{new} = \alpha \mu' + (1-\alpha)\mu$ or a type of moving average. Typically, you repeat for a few iterations and wait for the distribution to converge.

Comments (0)