Theory of Optimization: Gradient Descent

6 minute read

Published:

In this post, we will review the most basic and the most intuitive optimization method – the gradient decent method – in optimization.

Gradient Descent

The gradient descent algorithm works as follow: The algorithm requires an initial point \(x^{(0)}\in\mathbb R^n\) and step size \(h > 0\). Then the algorithm repeats to execute:

\[x^{(t+1)} = x^{(t)} - h\cdot\nabla f(x^{(t)}),\]

until \(||\nabla f(x^{(t)})|| \le \epsilon\). In the following of this section, we will assume that the gradient of \(f\) is L-lipschitz, i.e.

\[||\nabla f(x) - \nabla f(y)|| \le L||x-y||,\]

we call \(f\) is L-lipschitz gradient or L-smooth.

Analysis for General Smooth Functions

In order to show the convergence of gradient descent for the general functions, we need the following lemma,

Suppose \(f\) is L-lipschitz gradient, then

\[|f(y) - f(x) - \nabla f(x)^T(y-x)| \le \frac{L}{2}||x - y||^2.\]

Proof: From the basic calculus and the assumption that \(f\) is L-lipschitz gradient, we have

\begub{align} &|f(y) - f(x) - \nabla f(x)^T(y-x)| \newline =& \int_{0}^1 \nabla f(x + t(y-x))^T(y-x)dt - \nabla f(x)^T(y-x)\newline =& |\int_{0}^1 (\nabla f(x + t(y-x))-\nabla f(x))^T(y-x)dt| \newline \le& \int_{0}^1 ||\nabla f(x + t(y-x))-\nabla f(x)||\cdot ||y-x||dt \newline \le& \int_{0}^1 tL||y-x||^2dt \newline =& \frac{L}{2}||y-x||^2. \end{align}

The previous lemma shows and upper and lower bound of a point if the gradient of the function is lipschitz. From the previous lemma, we can then show the convergence of gradient descent.

Let \(f\) be a function with L-Lipschitz gradient and \(x^*\) be any minimizer of \(f\). The gradient descent with step size \(h = \frac{1}{L}\) outputs a point \(x\) such that \(||\nabla f(x)||\le \epsilon\) in \(\frac{2L}{\epsilon^2}(f(x^{(0)}) - f(x^*))\) iterations.

Proof: From the previous lemma, we have the following equation

\begub{align} f(x^{(t+1)}) \le& f(x^{(t)}) - \nabla f(x^{(t)})^T(x^{(t+1)} - x^{(t)}) + \frac{L}{2}||x^{(t+1)} - x^{(t)}||^2 \newline \le& f(x^{(t)}) - \nabla f(x^{(t)})^T\cdot \frac{1}{L}\nabla f(x^{(t)}) + \frac{L}{2}\frac{1}{L^2}||\nabla f(x^{(t)})||^2 \newline =& f(x^{(t)}) - \frac{1}{2L}||\nabla f(x^{(t)})||^2. \end{align}

Then we have \(||\nabla f(x^{(t)})||^2 \le 2L \cdot f(x^{(t)}) - f(x^{(t+1)}).\)

Sum up the equations for \(t=1,2,\dots\), we can get

\[\sum_{t=0}^T ||\nabla f(x^{(t)})||^2 \le 2L(f(x^{(0)}) - f(x^{(T+1)})) \le 2L(f(x^{(0)}) - f(x^{*})).\]

From this equation, we can see that the gradient descent outputs a point \(x\) such that \(||\nabla f(x)||\le \epsilon\) in \(\frac{2L}{\epsilon^2}(f(x^{(0)}) - f(x^*))\) iterations.

Analysis for Convex Functions

In this section, we assume that the function \(f\), which we want to optimize, is convex with L-lipschitz gradient. As a result of the convexity, we can show that the difference between \(f(x^*)\) and \(f(x^{(t)})\) is \(O(\frac{1}{t})\). Before we state the theorem, we prove an auxilary lemma.

For any convex function \(f\in\mathcal C^1(\mathbb R^n)\), we have

\[f(x) - f(y) \le ||\nabla f(x)||_2\cdot ||x-y||_2.\]

Proof: From the first order condition, if \(f\) is convex, we have

\[f(y) \ge f(x) + \nabla f(x)^T(y-x).\]

Arranging the terms and using the basic property of norms, we have

\[f(x) - f(y) \le \nabla f(x)^T(x-y) \le ||\nabla f(x)||\cdot ||x-y||.\]

With the help of the previous lemma, we can show the convergence bound for convex functions now.

Let \(f\in\mathcal C^2(\mathbb R^n)\) be convex with L-Lipschitz gradient and \(x^*\) be any minimizer of \(f\). With step size \(h = \frac{1}{L}\), the sequence \(x^{(k)}\) in Gradient Descent satis es

\[f(x^{(k)}) - f(x^{*}) \le \frac{2LD^2}{k+4},\]

where \(D = \max_{f(x) \le f(x^{(0)})}||x-x^*||_2.\)

Proof: Let \(\epsilon_k = f(x^{(k)}) - f(x^*)\). Since \(f\) is L-lipschitz gradient, we have

\[\epsilon_{k+1} \le \epsilon_k - \frac{1}{2L}||\nabla f(x^{(k)})||_2^2.\]

Then from the previous lemma, we have

\[\epsilon_k \le ||\nabla f(x^{(k)})||_2\cdot ||\cdot ||x^{(k)} - x^*||_2 \le ||\nabla f(x^{(k)})||_2\cdot D.\]

Therefore, we have

\[\epsilon_{k+1} \le \epsilon_k - \frac{1}{2L}\left(\frac{\epsilon_k}{D}\right)^2.\]

Moreover, we can also bound \(\epsilon_0\) by the following inequality,

\[\epsilon_0 = f(x^{(0)}) - f(x^*) \le \nabla f(x^*)^T(x^{(0)} - x^*) + \frac{L}{2}||x^{(0)} - x^*||^2 \le \frac{LD^2}{2}.\]

Then we have

\[\frac{1}{\epsilon_{k+1}} - \frac{1}{\epsilon_k} = \frac{\epsilon_k-\epsilon_{k+1}}{\epsilon_{k+1}\epsilon_k} \ge \frac{\epsilon_k-\epsilon_{k+1}}{\epsilon_k^2} \ge \frac{1}{2LD^2}.\]

Then

\[\frac{1}{\epsilon_k} \ge \frac{1}{\epsilon_0} + \frac{k}{2LD^2} \ge \frac{k+4}{2LD^2},\]

which completes the proof.

Strongly Convex Functions

In the previous analysis for general smooth functions and convex functions, we show that the gradient descent algorithm needs \(O(\frac{1}{\epsilon^2})\) and \(O(\frac{1}{\epsilon})\) iterations to converge to an \(\epsilon\)-approximation(Note: Here, \(\epsilon\)-approximation does not mean the formal definition of \(\epsilon\)-approximation in approximation algorithms. In the sense of convergence of general functions, it means the gradient is small, and in the sense of convergence of convex functions, it means the error to the optimal is small). Now we introduce the strongly convex functions, and analysis the convergence of gradient descent algorithm for strongly convex functions. In particular, we will show that it will use \(poly(\log\frac{1}{\epsilon})\) iterations to converge to an \(\epsilon\)-approximation.

Strongly convex: We call a function \(f \in \mathcal C^1(\mathbb R^n)\) is \(\mu\)-strongly convex if for any \(x, y \in \mathbb R^n\),

\[f(y) \ge f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2}||y-x||^2.\]

The following theorem gives an equivalent definition for strongly convex funtions if the function is twice differentiable.

Strongly convex: Let \(f \in \mathcal C^2(\mathbb R^n)\). Then \(f\) is \(\mu\)-strongly convex if and only if

\[\nabla^2 f(x)\succeq \mu\cdot I,\forall x\in\mathbb R^n.\]

We will not prove this theorem. This theorem can be shown by the Taylor’s expansion in multi-variable case.

With the definition of strongly convex functions, we can now prove the following theorem.

Let \(f\in\mathcal C^2(\mathbb R^n)\) be \(\mu\)-strongly convex with L-Lipschitz gradient and \(x^*\) be any minimizer of \(f\). With step size \(h = \frac{1}{L}\), the sequence \(x^{(k)}\) in GradientDescent satis es

\[f(x^{(k)}) - f(x^{*}) \le \left(1-\frac{\mu}{L}\right)^k(f(x^{(0)}) - f(x^{*})).\]

Proof: From the previous analysis, since \(f\) is L-lipschitz gradient, we have

\[\epsilon_{k+1} \le \epsilon_k - \frac{1}{2L}||\nabla f(x^{(k)})||_2^2,\]

where \(\epsilon_k = f(x^{(k)}) - f(x^*)\).

Then since \(f\) is \(\mu\)-strongly convex, we have

\[f(x^*) \ge f(x^{(k)}) + \nabla f(x^{(k)})^T(x^* - x^{(k)}) + \frac{\mu}{2}||x^* - x^{(k)}||^2.\]

Rearranging the terms, we have

\begub{align} f(x^{(k)}) - f(x^) \le& \nabla f(x^{(k)})^T(x^{(k)} - x^) - \frac{\mu}{2}||x^* - x^{(k)}||^2 \newline \le& \max_{\Delta}(\nabla f(x^{(k)})^T\Delta - \frac{\mu}{2}\Delta^2)\newline =& \frac{1}{2\mu}||\nabla f(x^{(k)})||^2. \end{align}

Putting the above inequality into the first inequality in the proof, we have

\begub{align} \epsilon_{k+1} \le& \epsilon_k - \frac{1}{2L}||\nabla f(x^{(k)})||_2^2 \newline \le& \epsilon_k - \frac{\mu}{L}\epsilon_k \newline \le& \left(1-\frac{\mu}{L}\right)^{k+1}\epsilon_0. \end{align}