Theory of Optimization: Frank-Wolfe Algorithm

2 minute read

Published: February 13, 2019

In this post, we describe a new geometry dependent algorithm that relies on different set of assumptions. The algorithm is called conditional gradient descent, aka Frank-Wolfe.

Frank-Wolfe

Algorithm

Frank-Wolfe algorithm solves the following convex optimization problem

$\min_{x\in\mathcal D}f(x),$

for $f$ such that $\nabla f(x)$ is `Lipschitz’ in a certain sense. The algorithm reduces the problem into a sequence of linear optimization problems.

Frank-Wolfe algorithm has the following procedure:

Initial point $x^{(0)}\in\mathbb R^n$ , step size $h>0$ .
For $k=0,1,\dots$ do
Compute $y^{(k)} = \arg\min_{y\in\mathcal D}\langle y, \nabla f(x^{(k)})\rangle$ .
$x^{(k+1)} \leftarrow (1-h_k)x^{(k)} + h_ky^{(k)}$ with $h_k = \frac{2}{k+2}$ .

Analysis

We have the following theorem for Frank-Wolfe algorithm.

Given a convex function $f$ on a convex set $\mathcal D$ and a constant $C_f$ such that
$f((1-h)x + hy)\le f(x) + h\langle \nabla f(x), y-x\rangle + \frac{1}{2}C_f h^2$
for any $x,y\in\mathcal D$ and $h\in [0,1]$ , we have
$f(x^{(k)}) - f(x^*) \le \frac{2C_f}{k+2}.$

Proof: By the definition of $C_f$ , we have

$f(x^{(k+1)})\le f(x^{(k)}) + h_k\langle \nabla f(x^{(k)}), y^{(k)}-x^{(k)}\rangle + \frac{1}{2}C_f h_k^2.$

Note that from the convexity of $f$ and the definition of $y^{(k)}$ , we have

$f(x^*) \ge f(x^{(k)}) + \langle \nabla f(x^{(k)}), x^* - x^{(k)}\rangle \ge f(x^{(k)})+\langle \nabla f(x^{(k)}), y^{(k)} - x^{(k)}\rangle.$

Hence, we have

$f(x^{(k+1)})\le f(x^{(k)}) - h_k(f(x^{(k)}) - f(x^*)) + \frac{1}{2}C_f h_k^2.$

Let $\epsilon_k = f(x^{(k)}) - f(x^*)$ , we have

$\epsilon_{k+1} \le (1-h_k)\epsilon_k + \frac{1}{2}C_f h_k^2.$

Note that $\epsilon_0 = f^{(0)} - f(x^*) \le \frac{1}{2}C_f$ , we can prove the theorem by induction.

⬜

Note that if $\nabla f(x)$ is L-Lipschitz with respect to $||\cdot||$ , over the domain $\mathcal D$ , then $C_f \le L\cdot diam_{||\cdot||}(\mathcal D)^2$ .

Sparsity Analysis

For the domain $\mathcal D$ is a simplex, the step $y^{(k)}$ is always a vertex of the simplex, and hence, each step of Frank-Wolfe increases the sparsity by at most $1$ . We can think that the convergence result of Frank Wolfe proves that there is an approximate sparse solution of the optimization problem.

Given a polytope $\mathcal P = conv(v_i)$ lies inside an unit ball. For any $u\in\mathcal P$ , there are $k = O(\frac{1}{\epsilon^2} )$ vertices $v_1,\dots, v_k$ of $\mathcal P$ such that
$||\sum_{i=1}^k\lambda_i v_i -u||_2 \le \epsilon,$
for some $\sum_i \lambda_i = 1, \lambda_i \ge 0$ .

Proof: Run Frank-Wolfe algorithm with $f(x) = ||x-u||_2^2$ . Note that $f$ is 1-Lipschitz with respect to $||x||_2$ and that the diameter of $\mathcal P$ is bounded by $1$ . Hence, we have $C_f \le 1$ .

Therefore, Frank-Wolfe algorithm shows that

$f(x^{(k)}) - f(x^*) \le \frac{2}{k+2}.$

Since $f(u) = 0$ , we have $||x^{(k)} - u||^2 \le \frac{4}{k+2}$ , and this gives the result.

⬜

Share on

Twitter Facebook LinkedIn

Haoyu Zhao

Theory of Optimization: Frank-Wolfe Algorithm

Frank-Wolfe

Algorithm

Analysis

Sparsity Analysis

Share on

You May Also Enjoy

Deep Reinforcement Learning: Model Based Reinforcement Learning

Deep Reinforcement Learning: Policy Gradient and Actor-Critic

Theory of Optimization: More on Mirror Descent

Theory of Optimization: Mirror Descent