Derivation of the Iterative Shrinkage and Thresholding Algorithm (ISTA)

We are interested in solving:

\[\hat{x} = \text{argmin}_{x} \frac{1}{2} \|b - Ax\|_2^2 + \lambda \|x\|_1.\]

Denote the objective function as $f(x) = \frac{1}{2} |b - Ax|_2^2 + \lambda |x|_1$. We will derive an algorithm for solving this.

Let $A = I$. Then the problem has the objective:

\[f(x) = \frac{1}{2} \|b - x\|_2^2 + \lambda \|x\|_1 = \sum_{i=1}^n \left( \frac{1}{2} |b_i - x_i|^2 + \lambda |x_i| \right).\]

This function is a sum of functions of the form:

$g( \tau) = \frac{1}{2} |\gamma - \tau|^2 + \lambda |\tau|, \; \; \; \; \gamma \in \mathbb{R}.$ We compute the subdifferential

\[\partial g( \tau) = \tau - \gamma + \lambda \partial|\tau|,\]

where

\[\partial|\tau| = \begin{cases} -1 & \text{for } \tau < 0, \\ 1 & \text{for } \tau > 0, \\ [-1, 1] & \text{for } \tau = 0, \end{cases}\]

which can be easily computed from the definition. At the unique optimal $\tau^*$ (for each $\lambda, \gamma$), we must have

\[0 \in \partial g( \tau^*).\]

We can now compute an expression for an optimal $\tau^$ as a function of $\lambda, \gamma$. If $\tau^ > 0$, we have $0 \in \partial g( \tau^*)$, which implies that

\[\tau^* = \gamma - \lambda.\]

Since $\tau^*>0$, this can only be true when $\gamma \geq \lambda$, and $\gamma \geq 0$, and hence we have determined

\[\tau^* = \gamma - \lambda \quad \text{on} \quad [\lambda, \infty).\]

Similarly, we compute that

\[\tau^* = \gamma + \lambda \quad \text{on} \quad (-\infty, \lambda].\]

Finally, we have the case of $\tau^* = 0$, which implies that

\[0 \in -\gamma + \lambda [-1, 1],\]

and hence $\gamma \in [-\lambda, \lambda]$.

This is a piecewise linear function. Another way of representing this is

\[\tau^* = ST_\lambda( \gamma),\]

Definition 1 The soft-thresholding function is

\[ST_\lambda( \gamma):=sgn( \gamma)\max(|\gamma|-\lambda,0).\]

Hence we have characterized optimal $\hat{x}$ as having coordinates described by $ST_\lambda(b_i)$.

We now prove the result for general PSD $A$:

\[h(x, x^{(0)}) = \frac{c}{2} \|x - x^{(0)}\|_2^2 - \frac{1}{2} \|Ax - Ax^{(0)}\|_2^2.\]

For this function to be strictly convex in $x$, we must have:

\[c > \|A^\top A\|_2 = \lambda_{\max}(A^\top A).\]

Define the new objective function:

\[\tilde{f}(x, x^{(0)}) = f(x) + h(x, x^{(0)}) = \frac{1}{2} \|b - Ax\|_2^2 + \lambda \|x\|_1 + \frac{c}{2} \|x - x^{(0)}\|_2^2 - \frac{1}{2} \|Ax - Ax^{(0)}\|_2^2.\]

Our goal will be to reduce this to something resembling the case with $A=I$, so we introduce the variable:

\[\nu_0 = x^{(0)} + \frac{1}{c} A^\top (b - Ax^{(0)}).\]

Then, by some algebra, we have:

\[\tilde{f}(x, x^{(0)}) = \frac{1}{2} \|b - Ax\|_2^2 + \lambda \|x\|_1 + \frac{c}{2} \|x - x^{(0)}\|_2^2 - \frac{1}{2} \|Ax - Ax^{(0)}\|_2^2 = C + \lambda \|x\|_1 + \frac{c}{2} \|x - \nu_0\|_2^2\]

where $ C $ doesn’t depend on $ x $ or $ x^{(0)} $. From before, we see that this is minimized by:

\[x^{(0),*} = ST_{\frac{\lambda}{c}}( \nu_0) = ST_{\frac{\lambda}{c}} \left( x^{(0)} + \frac{1}{c} A^\top (b - Ax^{(0)}) \right)\]

Thus,

\[x^{(0),*} = \text{argmin}_x \tilde{f}(x, x^{(0)}) = \text{argmin}_x f(x) + h(x, x^{(0)}).\]

This suggests (and derives) the proximal algorithm for this objective function and regularization:

\[x^{(i+1)} = ST_{\frac{\lambda}{c}} \left( x^{(i)} + \frac{1}{c} A^\top (b - Ax^{(i)}) \right).\]

This is the ISTA algorithm. More generally, we can change the $ \ell^1 $ regularization to get iterates of the form:

\[x^{(i+1)} = T_{\frac{\lambda}{c}} \left( x^{(i)} + \frac{1}{c} A^\top (b - Ax^{(i)}) \right).\]