You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>Based on this graphic representation, there are two major ways to apply the chain rules: the forward differentiation mode, and the reverse differentiation mode (not “backward differentiation”, which is a method used for solving ordinary differential equations). Next, we introduce these two methods.</p>
120
120
<sectionclass="level3" id="forward-mode">
121
121
<h3>Forward Mode</h3>
122
-
<p>Our target is to calculate <spanclass="math inline">\(\frac{\partial~y}{\partial~x_0}\)</span> (partial derivative regarding <spanclass="math inline">\(x_1\)</span> should be similar). But hold your horse, let’s start with some earlier intermediate results that might be helpful. For example, what is <spanclass="math inline">\(\frac{\partial~x_0}{\partial~x_1}\)</span>? 1, obviously. Equally obvious is <spanclass="math inline">\(\frac{\partial~x_1}{\partial~x_1} = 0\)</span>. It’s just elementary. Now, things gets a bit trickier: what is <spanclass="math inline">\(\frac{\partial~v_3}{\partial~x_0}\)</span>? It is a good time to use the chain rule:</p>
122
+
<p>Our target is to calculate <spanclass="math inline">\(\frac{\partial~y}{\partial~x_0}\)</span> (partial derivative regarding <spanclass="math inline">\(x_1\)</span> should be similar). But hold your horse, let’s start with some earlier intermediate results that might be helpful. For example, what is <spanclass="math inline">\(\frac{\partial~x_0}{\partial~x_1}\)</span>? It’s 0. Also, <spanclass="math inline">\(\frac{\partial~x_1}{\partial~x_1} = 1\)</span>. Now, things gets a bit trickier: what is <spanclass="math inline">\(\frac{\partial~v_3}{\partial~x_0}\)</span>? It is a good time to use the chain rule:</p>
<p>After calculating <spanclass="math inline">\(\frac{\partial~v_3}{\partial~x_0}\)</span>, we can then processed with derivatives of <spanclass="math inline">\(v_5\)</span>, <spanclass="math inline">\(v_6\)</span>, all the way to that of <spanclass="math inline">\(v_9\)</span> which is also the output <spanclass="math inline">\(y\)</span> we are looking for. This process starts with the input variables, and ends with output variables. Therefore, it is called <em>forward differentiation</em>. We can simplify the math notations in this process by letting <spanclass="math inline">\(\dot{v_i}=\frac{\partial~(v_i)}{\partial~x_0}\)</span>. The <spanclass="math inline">\(\dot{v_i}\)</span> here is called <em>tangent</em> of function <spanclass="math inline">\(v_i(x_0, x_1, \ldots, x_n)\)</span> with regard to input variable <spanclass="math inline">\(x_0\)</span>, and the original computation results at each intermediate point is called <em>primal</em> values. The forward differentiation mode is sometimes also called “tangent linear” mode.</p>
125
125
<p>Now we can present the full forward differentiation calculation process, as shown in tbl. 2. Two simultaneous computing processes take place, shown as two separated columns: on the left side is the computation procedure specified by eq. 5; on the right side shows computation of derivative for each intermediate variable with regard to <spanclass="math inline">\(x_0\)</span>. Let’s find out <spanclass="math inline">\(\dot{y}\)</span> when setting <spanclass="math inline">\(x_0 = 1\)</span>, and <spanclass="math inline">\(x_1 = 1\)</span>.</p>
<p>be the derivative of output variable <spanclass="math inline">\(y\)</span> with regard to intermediate node <spanclass="math inline">\(v_i\)</span>. It is called the <em>adjoint</em> of variable <spanclass="math inline">\(v_i\)</span> with respect to the output variable <spanclass="math inline">\(y\)</span>. Using this notation, eq. 6 can be expressed as:</p>
<p>Note the difference between tangent and adjoint. In the forward mode, we know <spanclass="math inline">\(\dot{v_0}\)</span> and <spanclass="math inline">\(\dot{v_1}\)</span>, then we calculate <spanclass="math inline">\(\dot{v_2}\)</span>, <spanclass="math inline">\(\dot{v3}\)</span>, …. and then finally we have <spanclass="math inline">\(\dot{v_9}\)</span>, which is the target. Here, we start with knowing <spanclass="math inline">\(\bar{v_9} = 1\)</span>, and then we calculate <spanclass="math inline">\(\bar{v_8}\)</span>, <spanclass="math inline">\(\bar{v_7}\)</span>, …. and then finally we have <spanclass="math inline">\(\bar{v_0} = \frac{\partial~y}{\partial~v_0} = \frac{\partial~y}{\partial~x_0}\)</span>, which is also exactly our target. Again, <spanclass="math inline">\(\dot{v_9} = \bar{v_0}\)</span> in this example, given that we are talking about derivative regarding <spanclass="math inline">\(x_0\)</span> when we use <spanclass="math inline">\(\dot{v_9}\)</span>. Following this line of calculation, the reverse differentiation mode is also called <em>adjoint mode</em>.</p>
226
+
<p>Note the difference between tangent and adjoint. In the forward mode, we know <spanclass="math inline">\(\dot{v_0}\)</span> and <spanclass="math inline">\(\dot{v_1}\)</span>, then we calculate <spanclass="math inline">\(\dot{v_2}\)</span>, <spanclass="math inline">\(\dot{v_3}\)</span>, …. and then finally we have <spanclass="math inline">\(\dot{v_9}\)</span>, which is the target. Here, we start with knowing <spanclass="math inline">\(\bar{v_9} = 1\)</span>, and then we calculate <spanclass="math inline">\(\bar{v_8}\)</span>, <spanclass="math inline">\(\bar{v_7}\)</span>, …. and then finally we have <spanclass="math inline">\(\bar{v_0} = \frac{\partial~y}{\partial~v_0} = \frac{\partial~y}{\partial~x_0}\)</span>, which is also exactly our target. Again, <spanclass="math inline">\(\dot{v_9} = \bar{v_0}\)</span> in this example, given that we are talking about derivative regarding <spanclass="math inline">\(x_0\)</span> when we use <spanclass="math inline">\(\dot{v_9}\)</span>. Following this line of calculation, the reverse differentiation mode is also called <em>adjoint mode</em>.</p>
227
227
<p>With that in mind, let’s see the full steps of performing reverse differentiation. First, we need to perform a forward pass to compute the required intermediate values, as shown in tbl. 3.</p>
<p>What we have seen is the basic of AD modules. There might be cases you do need to operate these low-level functions to write up your own applications (e.g., implementing a neural network), then knowing the mechanisms behind the scene is definitely a big plus. However, using these complex low level function hinders daily use of algorithmic differentiation in numerical computation task. In reality, you don’t really need to worry about forward or reverse mode if you simply use high-level APIs such as <code>diff</code>, <code>grad</code>, <code>hessian</code>, and etc. They are all built on the forward or reverse mode that we have seen, but provide clean interfaces, making a lot of details transparent to users. In this section we will introduce how to use these high level APIs.</p>
967
+
<p>What we have seen is the basic of AD modules. There might be cases you do need to operate these low-level functions to write up your own applications (e.g., implementing a neural network), then knowing the mechanisms behind the scene is definitely a big plus. However, using these complex low level function hinders daily use of algorithmic differentiation in numerical computation task. In reality, you don’t really need to worry about forward or reverse mode if you simply use high-level APIs such as <code>diff</code>, <code>grad</code>, <code>hessian</code>, etc. They are all built on the forward or reverse mode that we have seen, but provide clean interfaces, making a lot of details transparent to users. In this section we will introduce how to use these high level APIs.</p>
<p>The most basic and commonly used differentiation functions is used for calculating the <em>derivative</em> of a function. The AD module provides <code>diff</code> function for this task. Given a function <code>f</code> that takes a scalar as input and also returns a scalar value, we can calculate its derivative at a point <code>x</code> by <code>diff f x</code>, as shown in this function signature.</p>
@@ -1074,7 +1074,7 @@ <h3>Hessian and Laplacian</h3>
1074
1074
<p>Another way to extend the gradient is to find the second order derivatives of a multivariate function which takes <spanclass="math inline">\(n\)</span> input variables and outputs a scalar. Its second order derivatives can be organised as a matrix:</p>
1075
1075
<p><spanclass="math display">\[ \mathbf{H}(y) = \left[ \begin{matrix} \frac{\partial^2~y_1}{\partial~x_1^2} & \frac{\partial^2~y_1}{\partial~x_1~x_2} & \ldots & \frac{\partial^2~y_1}{\partial~x_1~x_n} \\ \frac{\partial^2~y_2}{\partial~x_2~x_1} & \frac{\partial^2~y_2}{\partial~x_2^2} & \ldots & \frac{\partial^2~y_2}{\partial~x_2~x_n} \\ \vdots & \vdots & \ldots & \vdots \\ \frac{\partial^2~y_m}{\partial^2~x_n~x_1} & \frac{\partial^2~y_m}{\partial~x_n~x_2} & \ldots & \frac{\partial^2~y_m}{\partial~x_n^2} \end{matrix} \right]\]</span></p>
1076
1076
<p>This matrix is called the <em>Hessian Matrix</em>. As an example of using it, consider the <em>newton’s method</em>. It is also used for solving the optimisation problem, i.e. to find the minimum value on a function. Instead of following the direction of the gradient, the newton method combines gradient and second order gradients: <spanclass="math inline">\(\frac{\nabla~f(x_n)}{\nabla^{2}~f(x_n)}\)</span>. Specifically, starting from a random position <spanclass="math inline">\(x_0\)</span>, and it can be iteratively updated by repeating this procedure until converge, as shown in eq. 8.</p>
<p>This process can be easily represented using the <code>Algodiff.D.hessian</code> function.</p>
1079
1079
<divclass="highlight">
1080
1080
<pre><codeclass="language-ocaml">open Algodiff.D
@@ -1097,7 +1097,7 @@ <h3>Hessian and Laplacian</h3>
1097
1097
</section>
1098
1098
<sectionclass="level3" id="other-apis">
1099
1099
<h3>Other APIs</h3>
1100
-
<p>Besides, there are also many helper functions, such as <code>jacobianv</code> for calculating jacobian vector product; <code>diff'</code> for calculating both <code>f x</code> and <code>diff f x</code>, and etc. They will come handy in certain cases for the programmers. Besides the functions we have already introduced, the complete list of APIs can be found in the table below.</p>
1100
+
<p>Besides, there are also many helper functions, such as <code>jacobianv</code> for calculating jacobian vector product; <code>diff'</code> for calculating both <code>f x</code> and <code>diff f x</code>, etc. They will come handy in certain cases for the programmers. Besides the functions we have already introduced, the complete list of APIs can be found in the table below.</p>
1101
1101
<divid="tbl:algodiff:apis">
1102
1102
<table>
1103
1103
<caption>Table 5: List of other APIs in the AD module of Owl</caption>
0 commit comments