Gradient-based Optimization

15.1. Gradient-based Optimization#

While there are so-called zeroth-order methods which can optimize a function without the gradient, most applications use first-order method which require the gradient. We will also show an example of a second-order method, Newton’s method, which require the Hessian matrix (that is, second derivatives).

In our examples, we will optimize the following function of the vector \(x\) with two components:

\[ f(x) = -\sin\left(\frac{x_1^2}{2} - \frac{x_2^2}{4} + 3 \right) \cos\left(2x_1 + 1 -e^{x_2}\right) \]

Since \(f\) only depends on two degrees of freedom, we can easily visualize it in the plane. Below we define the function in Julia, and create a helper function to plot a 2D function using contours.

using PyPlot

f(x) = -sin(x[1]^2/2 - x[2]^2/4 + 3) * cos(2x[1] + 1 - exp(x[2]))

function fplot(f)
    x = range(-2.5, stop=2.5, length=200)
    y = range( -1.5, stop=1.5, length=100)
    u = [f([x,y]) for y in y, x in x]
    
    gcf().set_size_inches(12,5)
    contour(x, y, u, 25, colors="k", linewidths=0.5)
    contourf(x, y, u, 25)
    axis("equal")
    colorbar()
    return
end

fplot(f)

../../_images/dafc1188e66c65874ff529266e0aa4bd23054403d47d89b0c968544cea6f306a.png

Note that although the function is only two-dimensional, it is fairly complex with strong nonlinearities and multiple local extrema.

15.1.1. Gradient function#

Since our methods will be gradient based, we also need to differentiate \(f(x)\) to produce the gradient \(\nabla f(x)\). Since this can be difficult to obtain, or at least highly time consuming, later we will explore alternatives to this such as numerical differentiation and automatic (symbolic) differentiation. But to begin with it is convenient to have a function for the gradient, which we define below.

function df(x)
    a1 = x[1]^2/2 - x[2]^2/4 + 3
    a2 = 2x[1] + 1 - exp(x[2])
    b1 = cos(a1)*cos(a2)
    b2 = sin(a1)*sin(a2)
    return -[x[1]*b1 - 2b2, -x[2]/2*b1 + exp(x[2])*b2]
end

df (generic function with 1 method)

15.1.2. Gradient descent#

The gradient descent method, or steepest descent, is based on the observation that for any given value of \(x\), the negative gradient \(-\nabla f(x)\) gives the direction of the fastest decrease of \(f(x)\). This means that there should be a scalar \(\alpha\) such that

\[ f(x - \alpha\nabla f(x)) < f(x) \]

Assuming such an \(\alpha\) can be found, the method then simply starts at some initial guess \(x_0\) and iterates

\[ x_{k+1} = x_k - \alpha_k \nabla f(x_k) \]

until some appropriate termination criterion is satisfied. With some assumptions, the method can be shown to converge to a local minimum.

In our first implementation below, we simply set \(\alpha_k\) to a specified constant. We also use \(\|\nabla f(x)\|_2\) for the termination criterion (which goes to zero at local minima). In order to study the methods properties, we output all of the steps \(x_k\) in a vector.

function gradient_descent(f, df, x0, α=0.1; tol=1e-4, maxiter=500)
    x = x0
    xs = [x0]
    for i = 1:maxiter
        gradient = df(x)
        if sqrt(sum(gradient.^2)) < tol
            break
        end
        x -= α*gradient
        push!(xs, x)
    end
    return xs
end

gradient_descent (generic function with 2 methods)

We can now run the method on our test problem. We first define a helper function to plot the “path” of the gradient descent method:

function plot_path(xs)
    plot(first.(xs), last.(xs), "w.", markersize=6)
    plot(first.(xs), last.(xs), "r", linewidth=1)
end

plot_path (generic function with 1 method)

Next we define three initial guesses \(x_0\), which are close to different local minima:

x0s = [[-2,.5], [0,.5], [2.2,-0.5]];

Finally we run the code for each \(x_0\) and plot the paths. We also output the length of the paths, and the gradient norm at the final iteration.

fplot(f)
title("Gradient descent, fixed \$\\alpha\$")
for x0 in x0s
    xs = gradient_descent(f, df, x0, 0.3)
    plot_path(xs)
    println("Path length = $(length(xs)), ||gradient|| = $(sqrt(sum(df(xs[end]).^2)))")
end

Path length = 501, ||gradient|| = 1.4715843307921153
Path length = 67, ||gradient|| = 8.918594790414185e-5
Path length = 501, ||gradient|| = 1.757580215668913

../../_images/103b4378b1e51fcae3303dd9aaab959f350606e0b4638d69cb6d21180c537a05.png

We note that the method can indeed find minima, but suffers from a few problems:

The selection of \(\alpha\) is manual and non-obvious. If it is too large, the iterations might end up too far from the minimum. If it is too small, a large number of iterations are needed. We will solve this below by introducing so-called line searches.
The method “zig-zags”, in particular if \(\alpha\) is too large. This is a fundamental problem with the gradient descent method, and the reason that we will look at better search directions (such as Newton’s method).

15.1.2.1. Line searches#

One way to improve the gradient descent method is to use a line search to determine \(\alpha_k\). The idea is simple: instead of using a fixed \(\alpha_k\), we minimize the one-dimensional function \(f(\alpha) = f(x_k - \alpha\nabla f(x_k))\). This can be done in many ways, but here we use a simple strategy of successively increasing \(\alpha\) (by factors of \(2\)) as long as \(f\) decreases.

function line_search(f, direction, x, αmin=1/2^20, αmax=2^20)
    α = αmin
    fold = f(x)
    while true
        if α ≥ αmax
            return α
        end
        fnew = f(x - α*direction)
        if fnew ≥ fold
            return α/2
        else
            fold = fnew
        end
        α *= 2
    end
end

line_search (generic function with 3 methods)

Next we write a new gradient descent method which uses line searches instead of fixed \(\alpha_k\). It is a very minor change to the previous function and we could easily have written one general function to handle both these cases, but for simplicity we create a new function.

function gradient_descent_line_search(f, df, x0; tol=1e-4, maxiter=500)
    x = x0
    xs = [x0]
    for i = 1:maxiter
        gradient = df(x)
        if sqrt(sum(gradient.^2)) < tol
            break
        end
        α = line_search(f, gradient, x)
        x -= α*gradient
        push!(xs, x)
    end
    return xs
end

gradient_descent_line_search (generic function with 1 method)

Running the same test case as before with this function, we see that it automatically determines appropriate \(\alpha_k\) values and converge for all three initial guesses.

fplot(f)
title("Gradient descent, line searches for \$\\alpha\$")
for x0 in x0s
    xs = gradient_descent_line_search(f, df, x0)
    plot_path(xs)
    println("Path length = $(length(xs)), ||gradient|| = $(sqrt(sum(df(xs[end]).^2)))")
end

Path length = 47, ||gradient|| = 7.870693264979224e-5
Path length = 19, ||gradient|| = 2.7076595197618347e-5
Path length = 34, ||gradient|| = 6.996634619625892e-5

../../_images/2378742c759d01157b973070d3bed83f45b4d7db072d3f628d7b949627688945.png

15.1.2.2. Numerical gradients#

An alternative to implementing the gradient function df by hand as above, it can be computed numerically using finite differences:

\[ \left(\nabla f(x)\right)_k \approx \frac{f(x + \epsilon d^k) - f(x - \epsilon d^k)}{2\epsilon} \]

Here, \(\epsilon\) is a (small) step size parameter, and the vector \(d^k\) is defined by \((d^k)_j = \delta_{ij}\) (that is, a zero vector with a single 1 at position \(k\)).

function finite_difference_gradient(f, x, ϵ=1e-5)
    n = length(x)
    df = zeros(n)
    for k = 1:n
        x1 = copy(x)
        x1[k] += ϵ
        fP = f(x1)
        x1[k] -= 2ϵ
        fM = f(x1)
        df[k] = (fP - fM) / 2ϵ
    end
    return df
end

finite_difference_gradient (generic function with 2 methods)

We can now run the previous test case, without having to compute df:

fplot(f)
title("Gradient descent, line searches for \$\\alpha\$, numerical gradients")
for x0 in x0s
    num_df(x) = finite_difference_gradient(f, x)
    xs = gradient_descent_line_search(f, num_df, x0)
    plot_path(xs)
    println("Path length = $(length(xs)), ||gradient|| = $(sqrt(sum(df(xs[end]).^2)))")
end

Path length = 47, ||gradient|| = 7.870695958738454e-5
Path length = 19, ||gradient|| = 2.7076776490056162e-5
Path length = 34, ||gradient|| = 6.996618216493087e-5

../../_images/fbb918e427c40f4a1cdcb03128e469f3a993c59be554cd7c15492e589ec22cc4.png

15.1.3. Newton’s method#

Even with line searches, the gradient descent method still suffers from the zig-zag behavior and slow convergence. If second derivatives can be obtained, Newton’s method can converge much faster. A simple way to describe the method is that we change the search direction in gradient descent to \(H(x_k)^{-1}\nabla f(x_k)\) instead of just \(\nabla f(x_k)\), where \(H(x_k)\) is the Hessian matrix.

We compute \(H(x)\) numerically using similar expressions as before (but for second derivatives):

function finite_difference_hessian(f, x, ϵ=1e-5)
    n = length(x)
    ddf = zeros(n,n)
    f0 = f(x)
    for i = 1:n
        x1 = copy(x)
        x1[i] += ϵ
        fP = f(x1)
        x1[i] -= 2ϵ
        fM = f(x1)
        ddf[i,i] = (fP - 2*f0 + fM) / ϵ^2
    end
    for i = 1:n, j = i+1:n
        x1 = copy(x)
        x1[i] += ϵ
        x1[j] += ϵ
        fPP = f(x1)
        x1[i] -= 2ϵ
        fMP = f(x1)
        x1[j] -= 2ϵ
        fMM = f(x1)
        x1[i] += 2ϵ
        fPM = f(x1)
        ddf[i,j] = (fPP - fMP - fPM + fMM) / 4ϵ^2
        ddf[j,i] = ddf[i,j]
    end
    return ddf
end

finite_difference_hessian (generic function with 2 methods)

We implement Newton’s method by re-using our gradient descent method with a different gradient direction.

Newton’s method is more sensitive to the initial guess \(x_0\), and in order to converge we have to move two of our values closer to their minima. However, when the method converges it is very fast.

fplot(f)
title("Newton's method, line searches for \$\\alpha\$, numerical gradients and Hessians")
x0s = [[-2,0.0], [0,.5], [2.,0.]];
for x0 in x0s
    search_dir(x) = finite_difference_hessian(f,x) \ finite_difference_gradient(f, x)
    xs = gradient_descent_line_search(f, search_dir, x0)
    plot_path(xs)
    println("Path length = $(length(xs)), ||gradient|| = $(sqrt(sum(df(xs[end]).^2)))")
end

Path length = 5, ||gradient|| = 2.459525335329783e-6
Path length = 4, ||gradient|| = 1.222661468371642e-5
Path length = 6, ||gradient|| = 1.7631665079368057e-8

../../_images/c8fe2637631b3f0e2b8772f2965542b193e388fb7cbc7af52d1577a9c5bcfad3.png