Theory of Optimization: Projected (Sub)Gradient Descent

6 minute read

Published: February 04, 2019

In this post, we will continue our analysis for gradient descent. Different from the previous post, we will not assume that the function is smooth. We will only assume that the function is convex and has some Lipschitz constant.

In the previous lecture, we assume that all of the functions has \(L\)-Lipschitz gradient. For general \(L\)-smooth functions, the gradient descent algorithm can get a first order \(\epsilon\)-critical proint in \(O(\frac{1}{\epsilon^2})\) iteration. When the function is convex, we show that we can use \(O(\frac{1}{\epsilon})\) iterations to get a solution differ from the optimal for at most \(\epsilon\). When the function is strongly convex and smooth, we show that the number of iterations can be reduced to \(O(poly(\log \frac{1}{\epsilon}))\).

However, in the previous post, we assume that the function is smooth, which implies that the function has gradient at all points. In this post, we will first assume that the function is convex but not smooth. Besides, in the previous post, we also focus on the unconstraint case, and in this posts, we will also introduce the analysis for constraint minimization.

Projected Subgradient Descent

In this post, we assume that the convex optimization problem has the following form:

\[\min_{x\in\mathcal X}f(x).\]

In the problem, \(\mathcal X\) is the constraint set, which could be \(\mathbb R^n\).

Subgradient: Let \(\mathcal X\subset \mathbb R^n\), and \(f:\mathcal X\to \mathbb R\). Then \(g\in\mathbb R^n\) is a subgradient of \(f\) at \(x\in\mathcal X\) if for any \(y\in\mathcal X\), one has
\[f(y) \ge f(x) + g^T(y-x).\]
We use \(\partial f(x)\) to denote the set of subgradient at \(x\), i.e.
\[\partial f(x) := \{g\in\mathbb R^n: g \text{ is a subgradient of } f \text{ at } x\}.\]

Note that if \(f\) is differentiable at point \(x\), then \(g\) is just \(\nabla f(x)\). So the notion of subgradient is compatible with that of the original gradient.

We will also introduce the projection operator \(\Pi_{\mathcal X}\) on \(\mathcal X\) by

\[\Pi_{\mathcal X}(x) = \arg\min_{y\in\mathcal X}||x-y||^2.\]

We have the following lemma for projection operator.

Let \(x\in\mathcal X\) and \(y\in\mathbb R^n\), then
\[(\Pi_{\mathcal X}(y) - x)^T(\Pi_{\mathcal X}(y) - y)\le 0,\]
which also implies \(||\Pi_{\mathcal X}(y) - x||^2 + ||y-\Pi_{\mathcal X}(y)||^2 \le ||y-x||^2.\)

Then, we will introduce the projected gradient descent algorithm. The algorithm works as follow:

For \(t = 1,2,\dots\)
\(y^{(t+1)} = x^{(t)} - \eta g_t, g_t\in\partial f(x^{(t)})\) and \(x^{(t+1)} = \Pi_{\mathcal X}(y^{(t+1)})\)

Analysis for Lipschitz Functions

Suppose \(f\) is convex. Furthermore, suppose that for any \(x\in\mathcal X, g\in\partial f(x)\), we have \(||g|| \le L\), then the projected subgradient method with \(\eta = \frac{R}{L\sqrt{t}}\) satisfies
\[f\left(\frac{1}{t}\sum_{s=1}^t x_s\right) - f(x^*) \le \frac{RL}{\sqrt{t}}.\]

Proof: By the convexity of \(f\), we have

\begin{align} &f(x^{(s)}) - f(x^) \newline \le& g_s(x^{(s)} - x^) \newline =& \frac{1}{\eta}(x^{(s)} - y^{(s+1)})^T(x^{(s)} - x^) \newline =& \frac{1}{2\eta}(||x^{(s)} - x^||^2 + ||x^{(s)} - y^{(s+1)}||^2 - ||x^{} - y^{(s+1)}||^2) \newline =& \frac{1}{2\eta}(||x^{(s)} - x^||^2 - ||x^{*} - y^{(s+1)}||^2) + \frac{\eta}{2}||g_s||^2. \end{align}

Now, from \(||g|| \le L\) and \(||x^{*} - y^{(s+1)}||^2 \ge ||x^{*} - x^{(s+1)}||^2\), we have

\[f(x^{(s)}) - f(x^*) \le \frac{1}{2\eta}(||x^{(s)} - x^*||^2 - ||x^{*} - x^{(s+1)}||^2) + \frac{\eta}{2}L^2.\]

Summing up the equations, we complete the proof.

⬜

Analysis for Smooth Functions(Lipschitz Gradient)

In this section, we assume that \(f\) is convex and \(\beta\)-smooth. In the previous post when the optimization problem is not constraint, we have the following inequality(change parameters)

\[f(x^{(k+1)}) \le f(x^{(k)}) - \frac{1}{2\beta}||\nabla f(x^{(k)})||_2^2.\]

However, this inequality may not be true in the constraint case, since we need to apply the projection operation. The next lemma shows the `right’ quantity to measure the descent procedure.

Let \(x,y\in\mathcal X,x^{+} = \Pi_{\mathcal X}(x-\frac{1}{\beta}\nabla f(x))\) and \(g_{\mathcal X}(x) = \beta (x-x^{+})\). Then the following holds true:
\[f(x^+) - f(y) \le g_{\mathcal X}(x)^T(x-y) - \frac{1}{2\beta}||g_{\mathcal X}(x)||^2.\]

Proof: We first observe that

\[\nabla f(x)^T(x^+ - y) \le g_{\mathcal X}(x)^T(x^+ - y),\]

since the above inequality is equivalent to

\[\left(x^+ - \left(x - \frac{1}{\beta}\nabla f(x)\right)\right)^T(x^+ - y) \le 0.\]

Then, we have

\begin{align} &f(x^+) - f(y) \newline =& f(x^+) -f(x) + f(x) - f(y) \newline \le& \nabla f(x)^T(x^+ - x) + \frac{\beta}{2}||x^+ - x||^2 + \nabla f(x)^T(x-y) \newline =& \nabla f(x)^T(x^+ - y) + \frac{1}{2\beta}||g_{\mathcal X}(x)||^2 \newline \le& g_{\mathcal X}(x)^T(x^+ - y) + \frac{1}{2\beta}||g_{\mathcal X}(x)||^2 \newline =& g_{\mathcal X}(x)^T(x - y) - \frac{1}{2\beta}||g_{\mathcal X}(x)||^2. \end{align}

⬜

Now we can show the following theorem for the convergence of PGD.

Let \(f\) be convex and \(\beta\)-smooth on \(\mathcal X\). Then projected gradient descent with \(\eta = \frac{1}{\beta}\) satisfies
\[f(x^{(t)}) - f(x^*) \le \frac{3\beta ||x^{(1)} - x^*||^2 + f(x^{(1)}) - f(x^*)}{t}.\]

Proof: From the previous lemma, we have

\[f(x^{(s+1)}) - f(x^{(s)}) \le -\frac{1}{2\beta}||g_{\mathcal X}(x^{(s)})||^2,\]

and

\[f(x^{(s+1)}) - f(x^*) \le ||f_{\mathcal X}(x^{(s)}||\cdot ||x^{(s)} - x^*||.\]

Then we show that \(||x^{(s)} - x^*||\) is decreasing with \(s\). From the previous lemma, we can also find

\[g_{\mathcal X}(x^{(s)})^T(x^{(s)} - x^*) \ge \frac{1}{2\beta}||g_{\mathcal X}(x^{(s)})||^2,\]

then we have

\begin{align} &||x^{(s+1)} - x^||^2 \newline =& ||x^{(s)} - \frac{1}{\beta}g_{\mathcal X}(x^{(s)}) - x^||^2 \newline =& ||x^{(s)} - x^||^2 - \frac{2}{\beta}g_{\mathcal X}(x^{(s)})^T(x^{(s)} - x^) + \frac{1}{\beta^2}||g_{\mathcal X}(x^{(s)})||^2\newline \le& ||x^{(s)} - x^*||^2. \end{align}

Let \(\epsilon_s = f(x^{(s)}) - f(x^*)\), we have

\begin{align} \epsilon_{s+1} \le& \epsilon_s - \frac{1}{2\beta}||g_{\mathcal X}(x^{(s)})||^2 \newline \le& \epsilon_s - \frac{1}{2\beta||x^{(s)} - x^||^2}\epsilon_{s+1}^2 \newline \le& \epsilon_s - \frac{1}{2\beta||x^{(1)} - x^||^2}\epsilon_{s+1}^2. \end{align}

Then we can finish the proof by simple induction.

⬜

Analysis for Strongly Convex Functions

In this section, we consider the projected gradient descent with time-varying step size \((\eta_t)_{t\ge 1}\), that is

For \(t = 1,2,\dots\)
\(y^{(t+1)} = x^{(t)} - \eta_t g_t, g_t\in\partial f(x^{(t)})\) and \(x^{(t+1)} = \Pi_{\mathcal X}(y^{(t+1)})\)

Then we have the following theorem

Let \(f\) be \(\alpha\)-strongly convex and \(L\)-Lipschitz on \(\mathcal X\). Then projected subgradient descent with \(\eta_s = \frac{2}{\alpha (s+1)}\) satisfies,
\[f\left(\sum_{s=1}^t\frac{2s}{t(t+1)}x_s\right) - f(x^*) \le \frac{2L^2}{\alpha(t+1)}.\]

Proof: Similar with the projected subgradient descent for lipschitz functions, we have

\[f(x^{(s)}) - f(x^*) \le \left(\frac{1}{2\eta_s} - \frac{\alpha}{2}\right)||x^{(s)} - x^*||^2 - \frac{1}{2\eta_s}||x^{*} - x^{(s+1)}||^2 + \frac{\eta_s}{2}L^2.\]

Multiplying the inequalities by \(s\), we have

\[s(f(x^{(s)}) - f(x^*)) \le \frac{\alpha}{4}\left(s(s-1)||x^{(s)} - x^*||^2 - s(s+1)||x^{*} - x^{(s+1)}||^2\right) + \frac{L^2}{\alpha}.\]

Summing up the above inequalities and applying the Jensen’s inequality, we complete the proof.

⬜

Analysis for Strongly Convex and Smooth Functions

The key improvement compared with convex and smooth function is that, one can show

\[f(x^+) - f(y) \le g_{\mathcal X}(x)^T(x-y) - \frac{1}{2\beta}||g_{\mathcal X}(x)||^2 - \frac{\alpha}{2}||x-y||^2.\]

Based on this result, we have the following theorem

Let \(f\) be \(\alpha\)-strongly convex and \(\beta\)-smooth on \(\mathcal X\). Then projected gradient descent with \(\eta = \frac{1}{\beta}\) satisfies,
\[||x^{(t+1)} - x^*||^2 = \left(1-\frac{\alpha}{\beta}\right)^t||x^{(1)} - x^*||^2.\]

Proof: Using the previous inequality with \(y = x^*\), we have

\begin{align} & ||x^{(t+1)} - x^||^2 \newline = & ||x^{(t)} - \frac{1}{\beta}g_{\mathcal X}(x^{(t)}) - x^||^2 \newline = & ||x^{(t)} - x^||^2 - \frac{2}{\beta}g_{\mathcal X}(x^{(t)})^T(x^{(t)} - x^) + \frac{1}{\beta^2}||g_{\mathcal X}(x^{(t)})||^2 \newline \le & ||x^{(t)} - x^||^2 - \frac{\alpha}{\beta}||x^{(t)} - x^||^2 \newline = & \left(1-\frac{\alpha}{\beta}\right)^t||x^{(1)} - x^*||^2. \end{align}

⬜

Share on

Twitter Facebook LinkedIn

Haoyu Zhao

Theory of Optimization: Projected (Sub)Gradient Descent

Projected Subgradient Descent

Analysis for Lipschitz Functions

Analysis for Smooth Functions(Lipschitz Gradient)

Analysis for Strongly Convex Functions

Analysis for Strongly Convex and Smooth Functions

Share on

You May Also Enjoy

Deep Reinforcement Learning: Model Based Reinforcement Learning

Deep Reinforcement Learning: Policy Gradient and Actor-Critic

Theory of Optimization: More on Mirror Descent

Theory of Optimization: Frank-Wolfe Algorithm