Theory of Optimization: Frank-Wolfe Algorithm
Published:
In this post, we describe a new geometry dependent algorithm that relies on different set of assumptions. The algorithm is called conditional gradient descent, aka Frank-Wolfe.
Frank-Wolfe
Algorithm
Frank-Wolfe algorithm solves the following convex optimization problem
minx∈Df(x),for f such that ∇f(x) is `Lipschitz’ in a certain sense. The algorithm reduces the problem into a sequence of linear optimization problems.
Frank-Wolfe algorithm has the following procedure:
- Initial point x(0)∈Rn, step size h>0.
- For k=0,1,… do
- Compute y(k)=argminy∈D⟨y,∇f(x(k))⟩.
- x(k+1)←(1−hk)x(k)+hky(k) with hk=2k+2.
Analysis
We have the following theorem for Frank-Wolfe algorithm.
Given a convex function f on a convex set D and a constant Cf such that
f((1−h)x+hy)≤f(x)+h⟨∇f(x),y−x⟩+12Cfh2for any x,y∈D and h∈[0,1], we have
f(x(k))−f(x∗)≤2Cfk+2.
Proof: By the definition of Cf, we have
f(x(k+1))≤f(x(k))+hk⟨∇f(x(k)),y(k)−x(k)⟩+12Cfh2k.Note that from the convexity of f and the definition of y(k), we have
f(x∗)≥f(x(k))+⟨∇f(x(k)),x∗−x(k)⟩≥f(x(k))+⟨∇f(x(k)),y(k)−x(k)⟩.Hence, we have
f(x(k+1))≤f(x(k))−hk(f(x(k))−f(x∗))+12Cfh2k.Let ϵk=f(x(k))−f(x∗), we have
ϵk+1≤(1−hk)ϵk+12Cfh2k.Note that ϵ0=f(0)−f(x∗)≤12Cf, we can prove the theorem by induction.
⬜
Note that if ∇f(x) is L-Lipschitz with respect to ||⋅||, over the domain D, then Cf≤L⋅diam||⋅||(D)2.
Sparsity Analysis
For the domain D is a simplex, the step y(k) is always a vertex of the simplex, and hence, each step of Frank-Wolfe increases the sparsity by at most 1. We can think that the convergence result of Frank Wolfe proves that there is an approximate sparse solution of the optimization problem.
Given a polytope P=conv(vi) lies inside an unit ball. For any u∈P, there are k=O(1ϵ2) vertices v1,…,vk of P such that
||k∑i=1λivi−u||2≤ϵ,for some ∑iλi=1,λi≥0.
Proof: Run Frank-Wolfe algorithm with f(x)=||x−u||22. Note that f is 1-Lipschitz with respect to ||x||2 and that the diameter of P is bounded by 1. Hence, we have Cf≤1.
Therefore, Frank-Wolfe algorithm shows that
f(x(k))−f(x∗)≤2k+2.Since f(u)=0, we have ||x(k)−u||2≤4k+2, and this gives the result.
⬜