17.2. The Optim.jl Package#

While building our own gradient descent and Newton methods is great for learning, practical applications require professional, optimized tools. In Julia, the Optim.jl package provides several high-performance optimizers.

It saves us from having to manually implement:

  • Robust line searches (e.g., Wolfe conditions)

  • Hessian approximations (quasi-Newton methods)

  • Trust-region methods

  • Stopping criteria (convergence checks)

We can simply provide our objective function f and, optionally, our gradient g and Hessian h. Optim.jl will then select an appropriate powerful algorithm to find the minimum.

using Optim, Plots

# First include all functions from the previous section
using NBInclude
@nbinclude("Gradient_Based_Optimization.ipynb");
Path 1: Length=501, ||gradient|| = 1.4715843307921153
Path 2: Length=67, ||gradient|| = 8.918594790414185e-5
Path 3: Length=501, ||gradient|| = 1.7575802156689104
Path 1: Length=47, ||gradient|| = 7.870693265014514e-5
Path 2: Length=19, ||gradient|| = 2.707659519760321e-5
Path 3: Length=34, ||gradient|| = 6.996634619625889e-5
Path 1: Length=47, ||gradient|| = 7.870696143799702e-5
Path 2: Length=19, ||gradient|| = 2.7076774319819133e-5
Path 3: Length=34, ||gradient|| = 6.996618451764213e-5
Path 1: Length=5, ||gradient|| = 2.4554302167287822e-6
Path 2: Length=4, ||gradient|| = 1.2227294235742997e-5
Path 3: Length=6, ||gradient|| = 1.75484871960624e-8

17.2.1. Method 1: Derivative-Free Optimization#

In the simplest case, we can provide only the function f and a starting point x0.

Optim.jl will automatically select a 0-th order (derivative-free) method, like Nelder-Mead. This is a great option when your function’s gradient is unknown, complex, or undefined.

# Our starting point
x0 = [0.0, 1.0]

# Call optimize with only the function and start point
result_nm = optimize(f, x0)

# The output shows it found a minimum using the Nelder-Mead algorithm.
 * Status: success

 * Candidate solution
    Final objective value:     -2.072854e-01

 * Found with
    Algorithm:     Nelder-Mead

 * Convergence measures
    √(Σ(yᵢ-ȳ)²)/n ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    34
    f(x) calls:    66

17.2.2. Method 2: First-Order (Gradient) Optimization#

If we provide the gradient, Optim.jl can use a much more powerful 1st-order method. It will automatically select L-BFGS, a famous quasi-Newton method. This is like our gradient descent, but it cleverly builds an approximation of the Hessian using only gradient information, making it converge much faster.

# Define our gradient function (using our numerical version)
g(x) = finite_difference_gradient(f, x)

# Call optimize with f, g, and x0
result_lbfgs = optimize(f, g, x0; inplace=false)

# The output shows it used L-BFGS and took far fewer iterations than Nelder-Mead.
 * Status: success

 * Candidate solution
    Final objective value:     -1.000000e+00

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 4.31e-06 ≰ 0.0e+00
    |x - x'|/|x'|          = 8.04e-07 ≰ 0.0e+00
    |f(x) - f(x')|         = 1.36e-11 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.36e-11 ≰ 0.0e+00
    |g(x)|                 = 6.88e-10 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    7
    f(x) calls:    20
    ∇f(x) calls:   20

17.2.3. Method 3: Second-Order (Hessian) Optimization#

If we can provide the Hessian, Optim.jl will use a full 2nd-order method, like NewtonTrustRegion. This is a robust version of the Newton’s method we built. It uses the full curvature information (the Hessian) to take very direct, fast steps to the minimum.

# Define our Hessian function (using our numerical version)
h(x) = finite_difference_hessian(f, x)

# Call optimize with f, g, h, and x0
result_newton = optimize(f, g, h, x0; inplace=false)

# This is very fast, converging in only a few steps.
 * Status: success

 * Candidate solution
    Final objective value:     -2.072854e-01

 * Found with
    Algorithm:     Newton's Method

 * Convergence measures
    |x - x'|               = 2.51e-05 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.96e-05 ≰ 0.0e+00
    |f(x) - f(x')|         = 2.16e-10 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.04e-09 ≰ 0.0e+00
    |g(x)|                 = 7.38e-10 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    5
    f(x) calls:    18
    ∇f(x) calls:   18
    ∇²f(x) calls:  5

17.2.4. Method 4: Automatic Differentiation (The Best of All Worlds)#

Manually writing g and h (even our numerical versions) is work. The best solution is often to use Automatic Differentiation (AD).

We can just pass autodiff=:forward to optimize. Optim.jl will automatically analyze our code for f and generate the exact gradient and Hessian functions for us, all at machine speed. It then uses this information to automatically select a powerful 2nd-order method.

# Call optimize using forward-mode automatic differentiation
# We ONLY provide f and x0!
result_ad = optimize(f, x0, autodiff=:forward)

# Look at the output: Optim.jl was ableto use the 'NewtonTrustRegion' method
# because it generated the gradient and Hessian itself!
# This gives us the power of Newton's method with the simplicity of Nelder-Mead.
 * Status: success

 * Candidate solution
    Final objective value:     -2.072854e-01

 * Found with
    Algorithm:     Nelder-Mead

 * Convergence measures
    √(Σ(yᵢ-ȳ)²)/n ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    34
    f(x) calls:    66

This also works for the Hessian matrix in Newton’s method. Note the extremely fast convergence (number of iterations):

result_newton_ad = optimize(f, x0, Newton(), autodiff=:forward)
 * Status: success

 * Candidate solution
    Final objective value:     -2.072854e-01

 * Found with
    Algorithm:     Newton's Method

 * Convergence measures
    |x - x'|               = 2.50e-05 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.95e-05 ≰ 0.0e+00
    |f(x) - f(x')|         = 2.14e-10 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.03e-09 ≰ 0.0e+00
    |g(x)|                 = 7.31e-10 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    5
    f(x) calls:    18
    ∇f(x) calls:   18
    ∇²f(x) calls:  5

Finally, we use the BFGS solver using automatic differentiation. This is a widely used method, since it obtains convergence comparable to Newton’s method but without requiring explicit Hessian matrices:

result_bfgs_ad = optimize(f, x0, BFGS(), autodiff=:forward)
 * Status: success

 * Candidate solution
    Final objective value:     -1.000000e+00

 * Found with
    Algorithm:     BFGS

 * Convergence measures
    |x - x'|               = 1.70e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 3.18e-08 ≰ 0.0e+00
    |f(x) - f(x')|         = 2.12e-14 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 2.12e-14 ≰ 0.0e+00
    |g(x)|                 = 1.18e-12 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    7
    f(x) calls:    22
    ∇f(x) calls:   22

17.2.5. Visualizing Optim’s Iterations#

Just like with our manual optimizers, Optim.jl will find different local minima depending on the start point. We can ask the solver to store_trace=true to record its path, and then plot it on our contour map.

# Note: The optimizer will jump outside out initial domain
# Plot a larger domain to see these new minima
fplot(f, xlim=(-5,5), ylim=(-6,1.5), levels=10)
plot!(title="Optim.jl (L-BFGS) from Different Starts")

# The same starting points from our manual notebook
x0s = [[-2.0, -0.5], [0.0, 1.0], [2.0, -1.0]]

for (i, x0) in enumerate(x0s)
    # We'll explicitly choose Newton and ask it to save the trace
    res = optimize(f, g, x0,
                   method=LBFGS(),
                   store_trace=true,
                   extended_trace=true,
                   inplace=false)
    
    # Extract the path from the trace object
    # The trace is a vector of states; we get the position 'x' from each state's metadata
    xs = [s.metadata["x"] for s in res.trace]
    
    # Plot the path
    plot_path(xs, label="Path $i")
    
    # Print the final position (the 'minimizer')
    println("Path $i: Length=$(length(xs)), ||gradient|| = $(sqrt(sum(∇f(xs[end]).^2)))")
end
plot!()

# Note that Path 2 and 3 "jumped" to different minima
Path 1: Length=8, ||gradient|| = 3.287736589951257e-10
Path 2: Length=8, ||gradient|| = 9.328012128983552e-10
Path 3: Length=8, ||gradient|| = 1.6690603046630815e-10