Theory of Optimization: Frank-Wolfe Algorithm

2 minute read

Published:

In this post, we describe a new geometry dependent algorithm that relies on different set of assumptions. The algorithm is called conditional gradient descent, aka Frank-Wolfe.

Frank-Wolfe

Algorithm

Frank-Wolfe algorithm solves the following convex optimization problem

\[\min_{x\in\mathcal D}f(x),\]

for \(f\) such that \(\nabla f(x)\) is `Lipschitz’ in a certain sense. The algorithm reduces the problem into a sequence of linear optimization problems.

Frank-Wolfe algorithm has the following procedure:

  1. Initial point \(x^{(0)}\in\mathbb R^n\), step size \(h>0\).
  2. For \(k=0,1,\dots\) do
  3. Compute \(y^{(k)} = \arg\min_{y\in\mathcal D}\langle y, \nabla f(x^{(k)})\rangle\).
  4. \(x^{(k+1)} \leftarrow (1-h_k)x^{(k)} + h_ky^{(k)}\) with \(h_k = \frac{2}{k+2}\).

Analysis

We have the following theorem for Frank-Wolfe algorithm.

Given a convex function \(f\) on a convex set \(\mathcal D\) and a constant \(C_f\) such that

\[f((1-h)x + hy)\le f(x) + h\langle \nabla f(x), y-x\rangle + \frac{1}{2}C_f h^2\]

for any \(x,y\in\mathcal D\) and \(h\in [0,1]\), we have

\[f(x^{(k)}) - f(x^*) \le \frac{2C_f}{k+2}.\]

Proof: By the definition of \(C_f\), we have

\[f(x^{(k+1)})\le f(x^{(k)}) + h_k\langle \nabla f(x^{(k)}), y^{(k)}-x^{(k)}\rangle + \frac{1}{2}C_f h_k^2.\]

Note that from the convexity of \(f\) and the definition of \(y^{(k)}\), we have

\[f(x^*) \ge f(x^{(k)}) + \langle \nabla f(x^{(k)}), x^* - x^{(k)}\rangle \ge f(x^{(k)})+\langle \nabla f(x^{(k)}), y^{(k)} - x^{(k)}\rangle.\]

Hence, we have

\[f(x^{(k+1)})\le f(x^{(k)}) - h_k(f(x^{(k)}) - f(x^*)) + \frac{1}{2}C_f h_k^2.\]

Let \(\epsilon_k = f(x^{(k)}) - f(x^*)\), we have

\[\epsilon_{k+1} \le (1-h_k)\epsilon_k + \frac{1}{2}C_f h_k^2.\]

Note that \(\epsilon_0 = f^{(0)} - f(x^*) \le \frac{1}{2}C_f\), we can prove the theorem by induction.

Note that if \(\nabla f(x)\) is L-Lipschitz with respect to \(||\cdot||\), over the domain \(\mathcal D\), then \(C_f \le L\cdot diam_{||\cdot||}(\mathcal D)^2\).

Sparsity Analysis

For the domain \(\mathcal D\) is a simplex, the step \(y^{(k)}\) is always a vertex of the simplex, and hence, each step of Frank-Wolfe increases the sparsity by at most \(1\). We can think that the convergence result of Frank Wolfe proves that there is an approximate sparse solution of the optimization problem.

Given a polytope \(\mathcal P = conv(v_i)\) lies inside an unit ball. For any \(u\in\mathcal P\), there are \(k = O(\frac{1}{\epsilon^2} )\) vertices \(v_1,\dots, v_k\) of \(\mathcal P\) such that

\[||\sum_{i=1}^k\lambda_i v_i -u||_2 \le \epsilon,\]

for some \(\sum_i \lambda_i = 1, \lambda_i \ge 0\).

Proof: Run Frank-Wolfe algorithm with \(f(x) = ||x-u||_2^2\). Note that \(f\) is 1-Lipschitz with respect to \(||x||_2\) and that the diameter of \(\mathcal P\) is bounded by \(1\). Hence, we have \(C_f \le 1\).

Therefore, Frank-Wolfe algorithm shows that

\[f(x^{(k)}) - f(x^*) \le \frac{2}{k+2}.\]

Since \(f(u) = 0\), we have \(||x^{(k)} - u||^2 \le \frac{4}{k+2}\), and this gives the result.