How Computers Do Differentiation?

Differentiation is a key concept in machine learning, especially when optimizing functions like loss functions in neural networks. It helps us find the minimum of these functions, which is crucial for tasks like training a model. But have you ever wondered how popular libraries like TensorFlow and PyTorch perform differentiation? Let’s break it down!

1. Manual Differentiation: The Old-School Method

In school, we learn how to manually compute derivatives using calculus. You apply a set of rules to functions to find how they change with respect to their inputs. For example, given a simple function like:

$f(x,y) = x^{2}\cdot{y} + y + 2$

We could compute the partial derivatives with respect to x and y as follows:

$\frac{\partial f}{\partial x} = 2\cdot{x}\cdot{y}$

This method works well for simple functions, but as functions become more complex, the process of differentiation becomes tedious and prone to errors. It’s not scalable for large models, especially those used in machine learning.

2. Finite Difference Approximation: A Simpler, But Less Accurate Method

Finite difference approximation is a numerical method for calculating derivatives without explicit formulas. It approximates the derivative as the slope of a secant line between two points on the function. The derivative at point x0 is defined as:

$\color{white}h'(x_0) \approx \frac{h(x_0 + \epsilon) - h(x_0)}{\epsilon}$

Where ε is a small value (e.g., 10^(-5))

Example with Python Code

# f(x, y) = x^2 y + y + 2 with (x,y)=(3,4)
def f(x, y):
    return x**2 * y + y + 2

def derivative(f, x, y, x_eps, y_eps):
    return (f(x + x_eps, y + y_eps) - f(x, y)) / (x_eps + y_eps)

df_dx = derivative(f, 3, 4, 0.00001, 0)
df_dy = derivative(f, 3, 4, 0, 0.00001)

print(f"df_dx: {df_dx}") # 24.000039999805264
print(f"df_dy: {df_dy}") # 10.000000000331966

Limitations

While simple, finite difference approximation is imprecise and becomes inefficient with many parameters. If there were 1000 parameters, we would need to call f() at least 1001 times. When you are dealing with large neural networks, this makes finite difference approximation way too inefficient.

3. Forward-Mode Autodiff

The figure below demonstrates forward-mode autodiff applied to the simple function 𝑔(𝑥,𝑦)=5+𝑥𝑦. The left graph shows the function, while the right graph represents the partial derivative ∂𝑔/∂x=0+(0⋅𝑥+𝑦⋅1)=𝑦. A similar process can be used to obtain the partial derivative with respect to 𝑦

plot

The forward-mode autodiff algorithm works by traversing the computation graph from inputs to outputs. It starts by computing the partial derivatives of the leaf nodes. For instance, the constant node 5 returns 0 (since the derivative of a constant is 0), 𝑥 returns 1 (since ∂𝑥/∂𝑥=1), and 𝑦 returns 0 (since ∂𝑦/∂𝑥=0).

Next, we move to the multiplication node, where the product rule is applied: ∂(𝑢⋅𝑣)/∂𝑥=∂𝑣/∂𝑥⋅𝑢+𝑣⋅∂𝑢/∂𝑥. This gives the expression 0⋅𝑥+𝑦⋅1.

Finally, at the addition node, the derivative of a sum is the sum of the derivatives, so we combine the parts to get ∂𝑔/∂𝑥=0+(0⋅𝑥+𝑦⋅1).

Though the equation can be simplified further to ∂𝑔/∂𝑥=𝑦, but imagine if our function had a variable 𝑧— we would need to calculate the entire graph once more. In real-world scenarios, the function may have many more variables, making the differentiation process much more complex and time-consuming.

4. Reverse-Mode Autodiff

Reverse-mode autodiff is a powerful technique commonly used in machine learning. It involves two passes through the computation graph. The first pass computes the value of each node from inputs to output. The second pass works in reverse (from output to inputs) to compute all partial derivatives. This process is known as “reverse mode” because gradients flow backward through the graph.

The graph bellow illustrates the second pass. During the first pass, all node values are computed starting from 𝑥 = 3 and 𝑦 = 4, with results shown at the bottom of each node (e.g., 𝑥 × 𝑥 = 9). The output node, 𝑓(3,4), results in 42.

plot

The reverse pass applies the chain rule to compute partial derivatives. The chain rule is given by:

$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial n_i} \times \frac{\partial n_i}{\partial x}$

where each n𝑖 represents an intermediate node in the graph.

Example of Calculating Partial Derivatives

Let’s go through the reverse pass step by step:

At n7 (the output node):

$\frac{\partial f}{\partial n_7} = 1$