Theory of Optimization: Frank-Wolfe Algorithm

2 minute read

Published: February 13, 2019

In this post, we describe a new geometry dependent algorithm that relies on different set of assumptions. The algorithm is called conditional gradient descent, aka Frank-Wolfe.

Frank-Wolfe

Algorithm

Frank-Wolfe algorithm solves the following convex optimization problem

\[\min_{x\in\mathcal D}f(x),\]

for \(f\) such that \(\nabla f(x)\) is `Lipschitz’ in a certain sense. The algorithm reduces the problem into a sequence of linear optimization problems.

Frank-Wolfe algorithm has the following procedure:

Initial point \(x^{(0)}\in\mathbb R^n\), step size \(h>0\).
For \(k=0,1,\dots\) do
Compute \(y^{(k)} = \arg\min_{y\in\mathcal D}\langle y, \nabla f(x^{(k)})\rangle\).
\(x^{(k+1)} \leftarrow (1-h_k)x^{(k)} + h_ky^{(k)}\) with \(h_k = \frac{2}{k+2}\).

Analysis

We have the following theorem for Frank-Wolfe algorithm.

Given a convex function \(f\) on a convex set \(\mathcal D\) and a constant \(C_f\) such that
\[f((1-h)x + hy)\le f(x) + h\langle \nabla f(x), y-x\rangle + \frac{1}{2}C_f h^2\]
for any \(x,y\in\mathcal D\) and \(h\in [0,1]\), we have
\[f(x^{(k)}) - f(x^*) \le \frac{2C_f}{k+2}.\]

Proof: By the definition of \(C_f\), we have

\[f(x^{(k+1)})\le f(x^{(k)}) + h_k\langle \nabla f(x^{(k)}), y^{(k)}-x^{(k)}\rangle + \frac{1}{2}C_f h_k^2.\]

Note that from the convexity of \(f\) and the definition of \(y^{(k)}\), we have

\[f(x^*) \ge f(x^{(k)}) + \langle \nabla f(x^{(k)}), x^* - x^{(k)}\rangle \ge f(x^{(k)})+\langle \nabla f(x^{(k)}), y^{(k)} - x^{(k)}\rangle.\]

Hence, we have

\[f(x^{(k+1)})\le f(x^{(k)}) - h_k(f(x^{(k)}) - f(x^*)) + \frac{1}{2}C_f h_k^2.\]

Let \(\epsilon_k = f(x^{(k)}) - f(x^*)\), we have

\[\epsilon_{k+1} \le (1-h_k)\epsilon_k + \frac{1}{2}C_f h_k^2.\]

Note that \(\epsilon_0 = f^{(0)} - f(x^*) \le \frac{1}{2}C_f\), we can prove the theorem by induction.

⬜

Note that if \(\nabla f(x)\) is L-Lipschitz with respect to \(||\cdot||\), over the domain \(\mathcal D\), then \(C_f \le L\cdot diam_{||\cdot||}(\mathcal D)^2\).

Sparsity Analysis

For the domain \(\mathcal D\) is a simplex, the step \(y^{(k)}\) is always a vertex of the simplex, and hence, each step of Frank-Wolfe increases the sparsity by at most \(1\). We can think that the convergence result of Frank Wolfe proves that there is an approximate sparse solution of the optimization problem.

Given a polytope \(\mathcal P = conv(v_i)\) lies inside an unit ball. For any \(u\in\mathcal P\), there are \(k = O(\frac{1}{\epsilon^2} )\) vertices \(v_1,\dots, v_k\) of \(\mathcal P\) such that
\[||\sum_{i=1}^k\lambda_i v_i -u||_2 \le \epsilon,\]
for some \(\sum_i \lambda_i = 1, \lambda_i \ge 0\).

Proof: Run Frank-Wolfe algorithm with \(f(x) = ||x-u||_2^2\). Note that \(f\) is 1-Lipschitz with respect to \(||x||_2\) and that the diameter of \(\mathcal P\) is bounded by \(1\). Hence, we have \(C_f \le 1\).

Therefore, Frank-Wolfe algorithm shows that

\[f(x^{(k)}) - f(x^*) \le \frac{2}{k+2}.\]

Since \(f(u) = 0\), we have \(||x^{(k)} - u||^2 \le \frac{4}{k+2}\), and this gives the result.

⬜

Share on

Twitter Facebook LinkedIn

Haoyu Zhao

Theory of Optimization: Frank-Wolfe Algorithm

Frank-Wolfe

Algorithm

Analysis

Sparsity Analysis

Share on

You May Also Enjoy

Deep Reinforcement Learning: Model Based Reinforcement Learning

Deep Reinforcement Learning: Policy Gradient and Actor-Critic

Theory of Optimization: More on Mirror Descent

Theory of Optimization: Mirror Descent