2 Differential of Multivariable Functions
2.1 Derivatives, Differentials, and Directional Derivatives
Definition. We call x 0 \mathbf{x}_0 x 0 an interior point of a set E ⊆ R m E \subseteq \mathbb{R}^m E ⊆ R m (we also say E E E is a neighborhood of x 0 \mathbf{x}_0 x 0 in R m \mathbb{R}^m R m ) if there exists a positive number δ x 0 > 0 \delta_{\mathbf{x}_0} > 0 δ x 0 > 0 such that
B ( x 0 , δ x 0 ) ⊆ E . \begin{align*}
B(\mathbf{x}_0, \delta_{\mathbf{x}_0}) \subseteq E.
\end{align*} B ( x 0 , δ x 0 ) ⊆ E .
In other words, x 0 \mathbf{x}_0 x 0 and all points in its immediate vicinity are contained within E E E . The set formed by all interior points of E E E is called the interior of E E E , denoted as int E \text{int} E int E .
We call E E E an open set if E = int E E = \text{int} E E = int E . This implies that every point in E E E is an interior point, or that E E E is an open neighborhood for each of its members.
Differential
Let x 0 \mathbf{x}_0 x 0 be an interior point of a set E ⊆ R m E \subseteq \mathbb{R}^m E ⊆ R m . We say that f : E → R n f: E \to \mathbb{R}^n f : E → R n is differentiable at x 0 \mathbf{x}_0 x 0 if there exists a linear mapping A : R m → R n A: \mathbb{R}^m \to \mathbb{R}^n A : R m → R n such that:
f ( x 0 + v ) = f ( x 0 ) + A v + o ( ∥ v ∥ ) , v → 0. \begin{align*}
f(\mathbf{x}_0 + \mathbf{v}) = f(\mathbf{x}_0) + A\mathbf{v} + o(\|\mathbf{v}\|), \quad \mathbf{v} \to \mathbf{0}.
\end{align*} f ( x 0 + v ) = f ( x 0 ) + A v + o ( ∥ v ∥ ) , v → 0 .
That also means:
∥ f ( x 0 + v ) − f ( x 0 ) − A v ∥ = o ( ∥ v ∥ ) , v → 0. \begin{align*}
\|f(\mathbf{x}_0 + \mathbf{v}) - f(\mathbf{x}_0) - A\mathbf{v} \| = o(\|\mathbf{v}\|), \quad \mathbf{v} \to \mathbf{0}.
\end{align*} ∥ f ( x 0 + v ) − f ( x 0 ) − A v ∥ = o ( ∥ v ∥ ) , v → 0 .
In this case, we call the linear mapping A A A the differential of f f f at x 0 \mathbf{x}_0 x 0 , denoted as:
d f ( x 0 ) : R m → R n , d f ( x 0 ) ( v ) = A v . \begin{align*}
df(\mathbf{x}_0): \mathbb{R}^m \to \mathbb{R}^n, \quad df(\mathbf{x}_0)(\mathbf{v}) = A\mathbf{v}.
\end{align*} df ( x 0 ) : R m → R n , df ( x 0 ) ( v ) = A v .
Theorem 2.1. If f f f is differentiable at x 0 \mathbf{x}_0 x 0 , then f f f is continuous at x 0 \mathbf{x}_0 x 0 .
Directional Derivative
Let f : E → R n f: E \to \mathbb{R}^n f : E → R n , x 0 ∈ E \mathbf{x}_0 \in E x 0 ∈ E , and a vector v ∈ R m \mathbf{v} \in \mathbb{R}^m v ∈ R m satisfy the following: there exists δ > 0 \delta > 0 δ > 0 such that for any 0 ≤ t < δ 0 \le t < \delta 0 ≤ t < δ , we have x 0 + t v ∈ E \mathbf{x}_0 + t\mathbf{v} \in E x 0 + t v ∈ E .
If the limit
d d t ( f ( x 0 + t v ) ) ∣ t = 0 = lim t → 0 f ( x 0 + t v ) − f ( x 0 ) t \begin{align*}
\left. \frac{d}{dt} (f(\mathbf{x}_0 + t\mathbf{v})) \right|_{t=0} = \lim_{t \to 0} \frac{f(\mathbf{x}_0 + t\mathbf{v}) - f(\mathbf{x}_0)}{t}
\end{align*} d t d ( f ( x 0 + t v )) t = 0 = t → 0 lim t f ( x 0 + t v ) − f ( x 0 )
exists, then we denote it as ∂ f ∂ v ( x 0 ) \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) ∂ v ∂ f ( x 0 ) or ∂ v f ( x 0 ) \partial_{\mathbf{v}}f(\mathbf{x}_0) ∂ v f ( x 0 ) , and call it the derivative of f f f at x 0 \mathbf{x}_0 x 0 along vector v \mathbf{v} v .
In particular, when v ∈ R m \mathbf{v} \in \mathbb{R}^m v ∈ R m is a unit vector (i.e., ∥ v ∥ = 1 \|\mathbf{v}\| = 1 ∥ v ∥ = 1 ), we call it the directional derivative of f f f at x 0 \mathbf{x}_0 x 0 along direction v \mathbf{v} v .
d f d s = lim Δ s → 0 Δ f Δ s \begin{align*} \frac{df}{ds} = \lim_{\Delta s \to 0} \frac{\Delta f}{\Delta s} \end{align*} d s df = Δ s → 0 lim Δ s Δ f
Theorem 2.2. If f f f is differentiable at x 0 \mathbf{x}_0 x 0 , then f f f has a derivative at x 0 \mathbf{x}_0 x 0 along every vector v \mathbf{v} v , and:
d f ( x 0 ) ( v ) = ∂ f ∂ v ( x 0 ) . \begin{align*}
df(\mathbf{x}_0)(\mathbf{v}) = \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0).
\end{align*} df ( x 0 ) ( v ) = ∂ v ∂ f ( x 0 ) .
Furthermore, in this case, the derivative ∂ f ∂ v ( x 0 ) \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) ∂ v ∂ f ( x 0 ) is linear with respect to the vector v \mathbf{v} v .
Theorem 2.3.
Any bilinear mapping B : R m × R n → R p B: \mathbb{R}^m \times \mathbb{R}^n \to \mathbb{R}^p B : R m × R n → R p is differentiable. For any point ( x 0 , y 0 ) (\mathbf{x}_0, \mathbf{y}_0) ( x 0 , y 0 ) and increment ( u , v ) (\mathbf{u}, \mathbf{v}) ( u , v ) , the differential is:
d B ( x 0 , y 0 ) ( u , v ) = B ( x 0 , v ) + B ( u , y 0 ) . \begin{align*}
dB(\mathbf{x}_0, \mathbf{y}_0)(\mathbf{u}, \mathbf{v}) = B(\mathbf{x}_0, \mathbf{v}) + B(\mathbf{u}, \mathbf{y}_0).
\end{align*} d B ( x 0 , y 0 ) ( u , v ) = B ( x 0 , v ) + B ( u , y 0 ) .
Any multilinear mapping L : R m 1 × R m 2 × ⋯ × R m k → R p L: \mathbb{R}^{m_1} \times \mathbb{R}^{m_2} \times \dots \times \mathbb{R}^{m_k} \to \mathbb{R}^p L : R m 1 × R m 2 × ⋯ × R m k → R p is differentiable, with the differential given by:
d L ( x 1 , x 2 , … , x k ) ( u 1 , u 2 , … , u k ) = ∑ j = 1 k L ( x 1 , … , u j , … , x k ) . \begin{align*}
dL(\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_k)(\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_k) = \sum_{j=1}^k L(\mathbf{x}_1, \dots, \mathbf{u}_j, \dots, \mathbf{x}_k).
\end{align*} d L ( x 1 , x 2 , … , x k ) ( u 1 , u 2 , … , u k ) = j = 1 ∑ k L ( x 1 , … , u j , … , x k ) .
The determinant det A \det A det A of an n × n n \times n n × n square matrix A A A is a multilinear function of its column vectors a 1 , … , a n \mathbf{a}_1, \dots, \mathbf{a}_n a 1 , … , a n . Therefore, the determinant is a differentiable function:
d det ( A ) ( B ) = ∑ j = 1 n det ( a 1 , … , b j , … , a n ) = ∑ j = 1 n ∑ i = 1 n b j i A j ∗ i = tr ( A ∗ T B ) . \begin{align*}
d\det(A)(B) = \sum_{j=1}^n \det(\mathbf{a}_1, \dots, \mathbf{b}_j, \dots, \mathbf{a}_n) = \sum_{j=1}^n \sum_{i=1}^n b_j^i A^{*i}_j = \text{tr}(A^{*T} B).
\end{align*} d det ( A ) ( B ) = j = 1 ∑ n det ( a 1 , … , b j , … , a n ) = j = 1 ∑ n i = 1 ∑ n b j i A j ∗ i = tr ( A ∗ T B ) .
Theorem 2.4. Let A 0 ∈ L ( R n , R n ) A_0 \in \mathcal{L}(\mathbb{R}^n, \mathbb{R}^n) A 0 ∈ L ( R n , R n ) be an invertible matrix, and define the open set U U U as:
U = { A ∈ L ( R n , R n ) ∣ ∥ A − A 0 ∥ < 1 ∥ A 0 − 1 ∥ } . \begin{align*}
U = \left\{ A \in \mathcal{L}(\mathbb{R}^n, \mathbb{R}^n) \mid \|A - A_0\| < \frac{1}{\|A_0^{-1}\|} \right\}.
\end{align*} U = { A ∈ L ( R n , R n ) ∣ ∥ A − A 0 ∥ < ∥ A 0 − 1 ∥ 1 } .
The inversion mapping f f f is differentiable on U U U .
f : U → L ( R n , R n ) , f ( A ) = A − 1 . \begin{align*}
f: U \to \mathcal{L}(\mathbb{R}^n, \mathbb{R}^n), \quad f(A) = A^{-1}.
\end{align*} f : U → L ( R n , R n ) , f ( A ) = A − 1 .
and
d f ( A ) ( B ) = − A − 1 B A − 1 . \begin{align*}
df(A)(B) = -A^{-1} B A^{-1}.
\end{align*} df ( A ) ( B ) = − A − 1 B A − 1 .
Theorem 2.5 (Chain Rule for Derivatives of Composite Functions). Let f f f be differentiable at x 0 \mathbf{x}_0 x 0 , and let g g g be differentiable at y 0 = f ( x 0 ) \mathbf{y}_0 = f(\mathbf{x}_0) y 0 = f ( x 0 ) . Then the composite function g ∘ f g \circ f g ∘ f is differentiable at x 0 \mathbf{x}_0 x 0 , and the differential of the composite is equal to the composite of the differentials :
d ( g ∘ f ) ( x 0 ) = d g ( y 0 ) ∘ d f ( x 0 ) = d g ( f ( x 0 ) ) ∘ d f ( x 0 ) . \begin{align*}
d(g \circ f)(\mathbf{x}_0) = dg(\mathbf{y}_0) \circ df(\mathbf{x}_0) = dg(f(\mathbf{x}_0)) \circ df(\mathbf{x}_0).
\end{align*} d ( g ∘ f ) ( x 0 ) = d g ( y 0 ) ∘ df ( x 0 ) = d g ( f ( x 0 )) ∘ df ( x 0 ) .
2.2 Coordinate Systems and Partial Derivatives
Partial Derivative
Let ( x 1 , x 2 , … , x m ) (x^1, x^2, \dots, x^m) ( x 1 , x 2 , … , x m ) be the coordinates of a point in E E E . Suppose that at an interior point x 0 = ( a 1 , … , a m ) \mathbf{x}_0 = (a^1, \dots, a^m) x 0 = ( a 1 , … , a m ) of E E E , the m m m -variable function f : E → R f: E \to \mathbb{R} f : E → R has the following limit:
lim t → 0 f ( a 1 , … , a k + t , … , a m ) − f ( a 1 , … , a k , … , a m ) t , \begin{align*}
\lim_{t \to 0} \frac{f(a^1, \dots, a^k + t, \dots, a^m) - f(a^1, \dots, a^k, \dots, a^m)}{t},
\end{align*} t → 0 lim t f ( a 1 , … , a k + t , … , a m ) − f ( a 1 , … , a k , … , a m ) ,
Then the value of this limit is denoted by ∂ f ∂ x k ( x 0 ) \frac{\partial f}{\partial x^k}(\mathbf{x}_0) ∂ x k ∂ f ( x 0 ) , f x k ( x 0 ) f_{x^k}(\mathbf{x}_0) f x k ( x 0 ) (written as f x k ′ ( x 0 ) f'_{x^k}(\mathbf{x}_0) f x k ′ ( x 0 ) in classical textbooks), ∂ x k f ( x 0 ) \partial_{x^k} f(\mathbf{x}_0) ∂ x k f ( x 0 ) , or ∂ k f ( x 0 ) \partial_k f(\mathbf{x}_0) ∂ k f ( x 0 ) , and is called the partial derivative of f f f with respect to the coordinate x k x^k x k .
Theorem 2.6. Let the m m m -variable function f f f be differentiable at x 0 \mathbf{x}_0 x 0 . Then for any v = ( ξ 1 , … , ξ m ) T ∈ R m \mathbf{v} = (\xi^1, \dots, \xi^m)^T \in \mathbb{R}^m v = ( ξ 1 , … , ξ m ) T ∈ R m ,
d f ( x 0 ) ( v ) = ∂ f ∂ v ( x 0 ) = ξ 1 ∂ f ∂ x 1 ( x 0 ) + ⋯ + ξ m ∂ f ∂ x m ( x 0 ) = ( ∂ f ∂ x 1 ( x 0 ) , … , ∂ f ∂ x m ( x 0 ) ) ( ξ 1 ⋮ ξ m ) . \begin{align*}
df(\mathbf{x}_0)(\mathbf{v}) &= \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) = \xi^1 \frac{\partial f}{\partial x^1}(\mathbf{x}_0) + \dots + \xi^m \frac{\partial f}{\partial x^m}(\mathbf{x}_0) \\
&= \left( \frac{\partial f}{\partial x^1}(\mathbf{x}_0), \dots, \frac{\partial f}{\partial x^m}(\mathbf{x}_0) \right) \begin{pmatrix} \xi^1 \\ \vdots \\ \xi^m \end{pmatrix}.
\end{align*} df ( x 0 ) ( v ) = ∂ v ∂ f ( x 0 ) = ξ 1 ∂ x 1 ∂ f ( x 0 ) + ⋯ + ξ m ∂ x m ∂ f ( x 0 ) = ( ∂ x 1 ∂ f ( x 0 ) , … , ∂ x m ∂ f ( x 0 ) ) ξ 1 ⋮ ξ m .
Therefore, the differential of the function f f f is
d f ( x 0 ) = ∂ f ∂ x 1 ( x 0 ) d x 1 + ⋯ + ∂ f ∂ x m ( x 0 ) d x m , \begin{align*}
df(\mathbf{x}_0) = \frac{\partial f}{\partial x^1}(\mathbf{x}_0) dx^1 + \dots + \frac{\partial f}{\partial x^m}(\mathbf{x}_0) dx^m,
\end{align*} df ( x 0 ) = ∂ x 1 ∂ f ( x 0 ) d x 1 + ⋯ + ∂ x m ∂ f ( x 0 ) d x m ,
where
d x i : R m → R , d x i ( v ) = ξ i , \begin{align*}
dx^i : \mathbb{R}^m \to \mathbb{R}, \quad dx^i(\mathbf{v}) = \xi^i,
\end{align*} d x i : R m → R , d x i ( v ) = ξ i ,
is the differential of the coordinate function x i : R m → R , ( x 1 , … , x m ) ↦ x i x^i : \mathbb{R}^m \to \mathbb{R}, \quad (x^1, \dots, x^m) \mapsto x^i x i : R m → R , ( x 1 , … , x m ) ↦ x i ; they are the coordinate functions of the vector.
Jacobian Matrix
If the m m m -variable mapping F : E → R n F: E \to \mathbb{R}^n F : E → R n ,
F ( x 1 , … , x m ) = ( f 1 ( x 1 , … , x m ) , … , f n ( x 1 , … , x m ) ) T \begin{align*}
F(x^1, \dots, x^m) = (f^1(x^1, \dots, x^m), \dots, f^n(x^1, \dots, x^m))^T
\end{align*} F ( x 1 , … , x m ) = ( f 1 ( x 1 , … , x m ) , … , f n ( x 1 , … , x m ) ) T
is differentiable at x 0 \mathbf{x}_0 x 0 , then for any v = ( ξ 1 , … , ξ m ) ∈ R m \mathbf{v} = (\xi^1, \dots, \xi^m) \in \mathbb{R}^m v = ( ξ 1 , … , ξ m ) ∈ R m ,
d F ( x 0 ) ( v ) = ( ∂ f 1 ∂ x 1 ( x 0 ) ⋯ ∂ f 1 ∂ x m ( x 0 ) ⋮ ⋱ ⋮ ∂ f n ∂ x 1 ( x 0 ) ⋯ ∂ f n ∂ x m ( x 0 ) ) ( ξ 1 ⋮ ξ m ) . \begin{align*}
dF(\mathbf{x}_0)(\mathbf{v}) = \begin{pmatrix}
\frac{\partial f^1}{\partial x^1}(\mathbf{x}_0) & \cdots & \frac{\partial f^1}{\partial x^m}(\mathbf{x}_0) \\
\vdots & \ddots & \vdots \\
\frac{\partial f^n}{\partial x^1}(\mathbf{x}_0) & \cdots & \frac{\partial f^n}{\partial x^m}(\mathbf{x}_0)
\end{pmatrix}
\begin{pmatrix}
\xi^1 \\
\vdots \\
\xi^m
\end{pmatrix}.
\end{align*} d F ( x 0 ) ( v ) = ∂ x 1 ∂ f 1 ( x 0 ) ⋮ ∂ x 1 ∂ f n ( x 0 ) ⋯ ⋱ ⋯ ∂ x m ∂ f 1 ( x 0 ) ⋮ ∂ x m ∂ f n ( x 0 ) ξ 1 ⋮ ξ m .
Therefore, the coordinate representation of the differential of the mapping F F F is
( ∂ f 1 ∂ x 1 ( x 0 ) ⋯ ∂ f 1 ∂ x m ( x 0 ) ⋮ ⋱ ⋮ ∂ f n ∂ x 1 ( x 0 ) ⋯ ∂ f n ∂ x m ( x 0 ) ) , \begin{align*}
\begin{pmatrix}
\frac{\partial f^1}{\partial x^1}(\mathbf{x}_0) & \cdots & \frac{\partial f^1}{\partial x^m}(\mathbf{x}_0) \\
\vdots & \ddots & \vdots \\
\frac{\partial f^n}{\partial x^1}(\mathbf{x}_0) & \cdots & \frac{\partial f^n}{\partial x^m}(\mathbf{x}_0)
\end{pmatrix},
\end{align*} ∂ x 1 ∂ f 1 ( x 0 ) ⋮ ∂ x 1 ∂ f n ( x 0 ) ⋯ ⋱ ⋯ ∂ x m ∂ f 1 ( x 0 ) ⋮ ∂ x m ∂ f n ( x 0 ) ,
Denote this matrix by J F ( x 0 ) = ∂ ( y 1 , … , y n ) ∂ ( x 1 , … , x m ) ∣ x 0 JF(\mathbf{x}_0) = \left. \frac{\partial(y^1, \dots, y^n)}{\partial(x^1, \dots, x^m)} \right|_{\mathbf{x}_0} J F ( x 0 ) = ∂ ( x 1 , … , x m ) ∂ ( y 1 , … , y n ) x 0 , called the Jacobian matrix of F F F at x 0 \mathbf{x}_0 x 0 .
The determinant of J F ( x 0 ) JF(\mathbf{x}_0) J F ( x 0 ) , det J F ( x 0 ) \det JF(\mathbf{x}_0) det J F ( x 0 ) , is called the Jacobian determinant of F F F at x 0 \mathbf{x}_0 x 0 .
Theorem 2.7. Let G G G be differentiable at x 0 \mathbf{x}_0 x 0 , and F F F be differentiable at y 0 = G ( x 0 ) \mathbf{y}_0 = G(\mathbf{x}_0) y 0 = G ( x 0 ) , then
J ( F ∘ G ) ( x 0 ) = J F ( y 0 ) ⋅ J G ( x 0 ) . \begin{align*}
J(F \circ G)(\mathbf{x}_0) = JF(\mathbf{y}_0) \cdot JG(\mathbf{x}_0).
\end{align*} J ( F ∘ G ) ( x 0 ) = J F ( y 0 ) ⋅ J G ( x 0 ) .
If the inverse mapping F − 1 F^{-1} F − 1 of a differentiable mapping F F F is also differentiable, then
J ( F − 1 ) ( y 0 ) = ( J F ( x 0 ) ) − 1 . \begin{align*}
J(F^{-1})(\mathbf{y}_0) = (JF(\mathbf{x}_0))^{-1}.
\end{align*} J ( F − 1 ) ( y 0 ) = ( J F ( x 0 ) ) − 1 .
Theorem 2.8. If all first-order partial derivatives ∂ f ∂ x 1 ( x ) , … , ∂ f ∂ x m ( x ) \frac{\partial f}{\partial x^1}(\mathbf{x}), \dots, \frac{\partial f}{\partial x^m}(\mathbf{x}) ∂ x 1 ∂ f ( x ) , … , ∂ x m ∂ f ( x ) are continuous , then f f f is differentiable. (It is not a necessary condition.)
2.3 Gradient and Directional Derivative
Theorem 2.9. Let ⟨ ⋅ , ⋅ ⟩ \langle \cdot, \cdot \rangle ⟨ ⋅ , ⋅ ⟩ be an inner product on R m \mathbb{R}^m R m . Then for any linear function L : R m → R L: \mathbb{R}^m \to \mathbb{R} L : R m → R on R m \mathbb{R}^m R m , there exists a unique vector ∇ L ∈ R m \nabla L \in \mathbb{R}^m ∇ L ∈ R m such that
L ( v ) = ⟨ v , ∇ L ⟩ , ∀ v ∈ R m . \begin{align*}
L(\mathbf{v}) = \langle \mathbf{v}, \nabla L \rangle, \quad \forall \mathbf{v} \in \mathbb{R}^m.
\end{align*} L ( v ) = ⟨ v , ∇ L ⟩ , ∀ v ∈ R m .
Consequently, ∇ L \nabla L ∇ L is orthogonal to Ker L \operatorname{Ker} L Ker L , and ∥ ∇ L ∥ = ∥ L ∥ = max ∥ v ∥ = 1 L ( v ) \|\nabla L\| = \|L\| = \max_{\|\mathbf{v}\|=1} L(\mathbf{v}) ∥∇ L ∥ = ∥ L ∥ = max ∥ v ∥ = 1 L ( v ) . A unit vector v \mathbf{v} v satisfies L ( v ) = ∥ ∇ L ∥ L(\mathbf{v}) = \|\nabla L\| L ( v ) = ∥∇ L ∥ if and only if v \mathbf{v} v is the unit vector in the direction of ∇ L \nabla L ∇ L . This unique vector ∇ L \nabla L ∇ L is called the gradient vector of the linear function L L L .
Gradient Vector
Let E E E be a subset of an inner product space, and the function f : E → R f : E \to \mathbb{R} f : E → R be differentiable at x 0 \mathbf{x}_0 x 0 . Denote the gradient vector of d f ( x 0 ) df(\mathbf{x}_0) df ( x 0 ) as grad f ( x 0 ) \operatorname{grad} f(\mathbf{x}_0) grad f ( x 0 ) or ∇ f ( x 0 ) \nabla f(\mathbf{x}_0) ∇ f ( x 0 ) , called the gradient vector of f f f at x 0 \mathbf{x}_0 x 0 . Thus,
d f ( x 0 ) ( v ) = ⟨ v , ∇ f ( x 0 ) ⟩ . \begin{align*}
df(\mathbf{x}_0)(\mathbf{v}) = \langle \mathbf{v}, \nabla f(\mathbf{x}_0) \rangle.
\end{align*} df ( x 0 ) ( v ) = ⟨ v , ∇ f ( x 0 )⟩ .
Theorem 2.10. Let f f f be differentiable at x 0 \mathbf{x}_0 x 0 . Then for any vector v ∈ R m \mathbf{v} \in \mathbb{R}^m v ∈ R m ,
∂ f ∂ v ( x 0 ) = d f ( x 0 ) ( v ) = ⟨ v , ∇ f ( x 0 ) ⟩ . \begin{align*}
\frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) = df(\mathbf{x}_0)(\mathbf{v}) = \langle \mathbf{v}, \nabla f(\mathbf{x}_0) \rangle.
\end{align*} ∂ v ∂ f ( x 0 ) = df ( x 0 ) ( v ) = ⟨ v , ∇ f ( x 0 )⟩ .
Consequently, for any unit vector v \mathbf{v} v ,
∂ f ∂ v ( x 0 ) ≤ ∥ ∇ f ( x 0 ) ∥ , \begin{align*}
\frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) \le \|\nabla f(\mathbf{x}_0)\|,
\end{align*} ∂ v ∂ f ( x 0 ) ≤ ∥∇ f ( x 0 ) ∥ ,
where equality holds if and only if v \mathbf{v} v is in the direction of the gradient of f f f at x 0 \mathbf{x}_0 x 0 . Therefore, the gradient direction of f f f is the direction in which f f f increases most rapidly.
Theorem 2.11. In any coordinate system ( x 1 , … , x m ) (x^1, \dots, x^m) ( x 1 , … , x m ) , the differential of a differentiable function f f f is a row vector:
d f ( x ) = ( ∂ f ∂ x 1 ( x ) , ∂ f ∂ x 2 ( x ) , … , ∂ f ∂ x m ( x ) ) . \begin{align*}
df(\mathbf{x}) = \left( \frac{\partial f}{\partial x^1}(\mathbf{x}), \frac{\partial f}{\partial x^2}(\mathbf{x}), \dots, \frac{\partial f}{\partial x^m}(\mathbf{x}) \right).
\end{align*} df ( x ) = ( ∂ x 1 ∂ f ( x ) , ∂ x 2 ∂ f ( x ) , … , ∂ x m ∂ f ( x ) ) .
In a Cartesian coordinate system ( x 1 , … , x m ) (x^1, \dots, x^m) ( x 1 , … , x m ) , the gradient of a differentiable function f f f is a column vector:
∇ f ( x ) = ( ∂ f ∂ x 1 ( x ) ∂ f ∂ x 2 ( x ) ⋮ ∂ f ∂ x m ( x ) ) . \begin{align*}
\nabla f(\mathbf{x}) = \begin{pmatrix}
\frac{\partial f}{\partial x^1}(\mathbf{x}) \\
\frac{\partial f}{\partial x^2}(\mathbf{x}) \\
\vdots \\
\frac{\partial f}{\partial x^m}(\mathbf{x})
\end{pmatrix}.
\end{align*} ∇ f ( x ) = ∂ x 1 ∂ f ( x ) ∂ x 2 ∂ f ( x ) ⋮ ∂ x m ∂ f ( x ) .
2.4 Higher-Order Partial Derivatives and Taylor Expansion
Higher-Order Partial Derivative
For i 1 , … , i k ∈ { 1 , 2 , … , m } i_1, \dots, i_k \in \{1, 2, \dots, m\} i 1 , … , i k ∈ { 1 , 2 , … , m } , denote
∂ k f ∂ ( x i k ) ⋯ ∂ ( x i 1 ) ( x 0 ) = ∂ ∂ ( x i k ) ( ⋯ ∂ f ∂ ( x i 1 ) ) ( x 0 ) . \begin{align*}
\frac{\partial^k f}{\partial (x^{i_k}) \cdots \partial (x^{i_1})}(\mathbf{x}_0) = \frac{\partial}{\partial (x^{i_k})} \left( \cdots \frac{\partial f}{\partial (x^{i_1})} \right) (\mathbf{x}_0).
\end{align*} ∂ ( x i k ) ⋯ ∂ ( x i 1 ) ∂ k f ( x 0 ) = ∂ ( x i k ) ∂ ( ⋯ ∂ ( x i 1 ) ∂ f ) ( x 0 ) .
This is called a k k k -th order partial derivative of f f f at x 0 \mathbf{x}_0 x 0 .
Higher-order partial derivatives are sometimes also denoted by symbols such as: ∂ x i k , … , x i 1 k f \partial^k_{x^{i_k}, \dots, x^{i_1}} f ∂ x i k , … , x i 1 k f , ∂ i k , … , i 2 , i 1 k f \partial^k_{i_k, \dots, i_2, i_1} f ∂ i k , … , i 2 , i 1 k f , f x i 1 , … , x i k ( k ) f^{(k)}_{x^{i_1}, \dots, x^{i_k}} f x i 1 , … , x i k ( k ) , or even f x i 1 , … , x i k f_{x^{i_1}, \dots, x^{i_k}} f x i 1 , … , x i k .
If for any i 1 , … , i k ∈ { 1 , 2 , … , m } i_1, \dots, i_k \in \{1, 2, \dots, m\} i 1 , … , i k ∈ { 1 , 2 , … , m } , the partial derivative functions ∂ k f ∂ x i k ⋯ ∂ x i 1 ( x ) \frac{\partial^k f}{\partial x^{i_k} \cdots \partial x^{i_1}}(\mathbf{x}) ∂ x i k ⋯ ∂ x i 1 ∂ k f ( x ) are all continuous, that is, any k k k -th order partial derivative of f f f is continuous, then f f f is called a C k \mathscr{C}^k C k function, denoted as f ∈ C k f \in \mathscr{C}^k f ∈ C k .
If for any positive integer k k k , f ∈ C k f \in \mathscr{C}^k f ∈ C k , it is denoted as f ∈ C ∞ f \in \mathscr{C}^\infty f ∈ C ∞ .
For a multivariate mapping F ( x 1 , … , x m ) = ( f 1 ( x 1 , … , x m ) , … , f n ( x 1 , … , x m ) ) T F(x^1, \dots, x^m) = (f^1(x^1, \dots, x^m), \dots, f^n(x^1, \dots, x^m))^T F ( x 1 , … , x m ) = ( f 1 ( x 1 , … , x m ) , … , f n ( x 1 , … , x m ) ) T , F F F is said to be C k \mathscr{C}^k C k (or C ∞ \mathscr{C}^\infty C ∞ ) if each of its component functions f i f^i f i is C k \mathscr{C}^k C k (or C ∞ \mathscr{C}^\infty C ∞ ).
Theorem 2.12.
(1) If functions f , g f, g f , g are both C k \mathscr{C}^k C k , then f + g , f g f + g, fg f + g , f g are also C k \mathscr{C}^k C k .
(2) If f , g f, g f , g in the composite mapping g ∘ f g \circ f g ∘ f are both C k \mathscr{C}^k C k , then the mapping g ∘ f g \circ f g ∘ f is also C k \mathscr{C}^k C k .
(3) If functions f , g f, g f , g are both C k \mathscr{C}^k C k , and g ( x 1 , … , x m ) ≠ 0 g(x^1, \dots, x^m) \neq 0 g ( x 1 , … , x m ) = 0 , then f / g f/g f / g is also C k \mathscr{C}^k C k .
Theorem 2.13. For any C k \mathscr{C}^k C k function f f f , for any i 1 , … , i k ∈ { 1 , 2 , … , m } i_1, \dots, i_k \in \{1, 2, \dots, m\} i 1 , … , i k ∈ { 1 , 2 , … , m } and any permutation π \pi π of 1 , 2 , … , k 1, 2, \dots, k 1 , 2 , … , k ,
∂ k f ∂ x i π ( k ) … ∂ x i π ( 1 ) ( x ) = ∂ k f ∂ x i k … ∂ x i 1 ( x ) . \begin{align*}
\frac{\partial^k f}{\partial x^{i_{\pi(k)}} \dots \partial x^{i_{\pi(1)}}}(\mathbf{x}) &= \frac{\partial^k f}{\partial x^{i_k} \dots \partial x^{i_1}}(\mathbf{x}).
\end{align*} ∂ x i π ( k ) … ∂ x i π ( 1 ) ∂ k f ( x ) = ∂ x i k … ∂ x i 1 ∂ k f ( x ) .
Higher-Order Differentials
The first-order differential d f ( x ) \mathrm{d}f(\mathbf{x}) d f ( x ) of a function f f f at x \mathbf{x} x is a linear function with respect to a vector v \mathbf{v} v :
d f ( x ) ( v ) = ∂ f ∂ v ( x ) = ∑ i ∂ f ∂ x i ( x ) v i . \begin{align*}
\mathrm{d}f(\mathbf{x})(\mathbf{v}) &= \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}) = \sum_i \frac{\partial f}{\partial x^i}(\mathbf{x})v^i.
\end{align*} d f ( x ) ( v ) = ∂ v ∂ f ( x ) = i ∑ ∂ x i ∂ f ( x ) v i .
Differentiating it again with respect to x \mathbf{x} x yields the second-order differential of f f f , which is a bilinear function with respect to a pair of vectors v , w \mathbf{v, w} v , w :
d 2 f ( x ) ( v , w ) = ∂ ∂ w ∂ f ∂ v ( x ) = ∑ i ( ∑ j ∂ 2 f ∂ x j ∂ x i ( x ) w j ) v i = w T H f ( x ) v , \begin{align*}
\mathrm{d}^2f(\mathbf{x})(\mathbf{v, w}) &= \frac{\partial}{\partial \mathbf{w}} \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}) = \sum_i \left( \sum_j \frac{\partial^2 f}{\partial x^j \partial x^i}(\mathbf{x})w^j \right) v^i = \mathbf{w}^T H_f(\mathbf{x}) \mathbf{v},
\end{align*} d 2 f ( x ) ( v , w ) = ∂ w ∂ ∂ v ∂ f ( x ) = i ∑ ( j ∑ ∂ x j ∂ x i ∂ 2 f ( x ) w j ) v i = w T H f ( x ) v ,
Hessian Matrix
where the coefficient matrix
H f ( x ) = ( ∂ 2 f ∂ ( x 1 ) 2 ( x ) ∂ 2 f ∂ x 1 ∂ x 2 ( x ) ⋯ ∂ 2 f ∂ x 1 ∂ x m ( x ) ∂ 2 f ∂ x 2 ∂ x 1 ( x ) ∂ 2 f ∂ ( x 2 ) 2 ( x ) ⋯ ∂ 2 f ∂ x 2 ∂ x m ( x ) ⋮ ⋮ ⋱ ⋮ ∂ 2 f ∂ x m ∂ x 1 ( x ) ∂ 2 f ∂ x m ∂ x 2 ( x ) ⋯ ∂ 2 f ∂ ( x m ) 2 ( x ) ) , \begin{align*}
H_f(\mathbf{x}) = \begin{pmatrix}
\frac{\partial^2 f}{\partial(x^1)^2}(\mathbf{x}) & \frac{\partial^2 f}{\partial x^1 \partial x^2}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial x^1 \partial x^m}(\mathbf{x}) \\
\frac{\partial^2 f}{\partial x^2 \partial x^1}(\mathbf{x}) & \frac{\partial^2 f}{\partial(x^2)^2}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial x^2 \partial x^m}(\mathbf{x}) \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x^m \partial x^1}(\mathbf{x}) & \frac{\partial^2 f}{\partial x^m \partial x^2}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial(x^m)^2}(\mathbf{x})
\end{pmatrix},
\end{align*} H f ( x ) = ∂ ( x 1 ) 2 ∂ 2 f ( x ) ∂ x 2 ∂ x 1 ∂ 2 f ( x ) ⋮ ∂ x m ∂ x 1 ∂ 2 f ( x ) ∂ x 1 ∂ x 2 ∂ 2 f ( x ) ∂ ( x 2 ) 2 ∂ 2 f ( x ) ⋮ ∂ x m ∂ x 2 ∂ 2 f ( x ) ⋯ ⋯ ⋱ ⋯ ∂ x 1 ∂ x m ∂ 2 f ( x ) ∂ x 2 ∂ x m ∂ 2 f ( x ) ⋮ ∂ ( x m ) 2 ∂ 2 f ( x ) ,
is called the Hessian matrix of the function f f f at x \mathbf{x} x .
In general, the k k k -th order differential of the function f f f is a k k k -multilinear function with respect to k k k vectors v 1 , v 2 , … , v k \mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_k v 1 , v 2 , … , v k :
d k f ( x ) ( v 1 , v 2 , … , v k ) = ∑ i 1 , i 2 , … , i k ∂ k f ∂ x i k ∂ … x i 2 ∂ x i 1 ( x ) v 1 i 1 v 2 i 2 … v k i k = ∂ k f ∂ v k … ∂ v 1 ( x ) . \begin{align*}
\mathrm{d}^k f(\mathbf{x})(\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_k) &= \sum_{i_1, i_2, \dots, i_k} \frac{\partial^k f}{\partial x^{i_k} \partial \dots x^{i_2} \partial x^{i_1}}(\mathbf{x}) v_1^{i_1} v_2^{i_2} \dots v_k^{i_k} \\
&= \frac{\partial^k f}{\partial \mathbf{v}_k \dots \partial \mathbf{v}_1}(\mathbf{x}).
\end{align*} d k f ( x ) ( v 1 , v 2 , … , v k ) = i 1 , i 2 , … , i k ∑ ∂ x i k ∂ … x i 2 ∂ x i 1 ∂ k f ( x ) v 1 i 1 v 2 i 2 … v k i k = ∂ v k … ∂ v 1 ∂ k f ( x ) .
When f ∈ C k f \in \mathscr{C}^k f ∈ C k , d k f ( x ) \mathrm{d}^k f(\mathbf{x}) d k f ( x ) is a symmetric k k k -multilinear function.
Taylor Expansion
For a C k \mathscr{C}^k C k function f f f , consider g ( t ) = f ( x 0 + t v ) g(t) = f(\mathbf{x}_0 + t\mathbf{v}) g ( t ) = f ( x 0 + t v ) . Then g g g is C k \mathscr{C}^k C k ,
In general,
g ( k ) ( t ) = ∑ 1 ≤ i 1 , … , i k ≤ m ξ i 1 … ξ i k ∂ k f ∂ x i 1 … ∂ x i k ( x 0 + t v ) = ( ∑ i = 1 m ξ i ∂ ∂ x i ) k f ( x 0 + t v ) . \begin{align*}
g^{(k)}(t) &= \sum_{1 \le i_1, \dots, i_k \le m} \xi^{i_1} \dots \xi^{i_k} \frac{\partial^k f}{\partial x^{i_1} \dots \partial x^{i_k}}(\mathbf{x}_0 + t\mathbf{v}) = \left( \sum_{i=1}^m \xi^i \frac{\partial}{\partial x^i} \right)^k f(\mathbf{x}_0 + t\mathbf{v}).
\end{align*} g ( k ) ( t ) = 1 ≤ i 1 , … , i k ≤ m ∑ ξ i 1 … ξ i k ∂ x i 1 … ∂ x i k ∂ k f ( x 0 + t v ) = ( i = 1 ∑ m ξ i ∂ x i ∂ ) k f ( x 0 + t v ) .
where
( ∑ i = 1 m ξ i ∂ ∂ x i ) k = ∑ i 1 , … , i k ∈ { 1 , 2 , … , m } ξ i 1 … ξ i k ∂ k ∂ x i 1 … ∂ x i k . \begin{align*}
\left( \sum_{i=1}^m \xi^i \frac{\partial}{\partial x^i} \right)^k &= \sum_{i_1, \dots, i_k \in \{1, 2, \dots, m\}} \xi^{i_1} \dots \xi^{i_k} \frac{\partial^k}{\partial x^{i_1} \dots \partial x^{i_k}}.
\end{align*} ( i = 1 ∑ m ξ i ∂ x i ∂ ) k = i 1 , … , i k ∈ { 1 , 2 , … , m } ∑ ξ i 1 … ξ i k ∂ x i 1 … ∂ x i k ∂ k .
Thus
f ( x 0 + v ) = g ( 1 ) = ∑ j = 0 k − 1 g ( j ) ( 0 ) j ! 1 j + ∫ 0 1 g ( k ) ( s ) ( k − 1 ) ! ( 1 − s ) k − 1 d s = ∑ j = 0 k − 1 ( ∑ i = 1 m ξ i ∂ ∂ x i ) j f ( x 0 ) j ! + ∫ 0 1 ( ∑ i = 1 m ξ i ∂ ∂ x i ) k f ( x 0 + s v ) ( k − 1 ) ! ( 1 − s ) k − 1 d s . \begin{align*}
f(\mathbf{x}_0 + \mathbf{v}) = g(1) &= \sum_{j=0}^{k-1} \frac{g^{(j)}(0)}{j!} 1^j + \int_0^1 \frac{g^{(k)}(s)}{(k-1)!}(1 - s)^{k-1}\mathrm{d}s \\
&= \sum_{j=0}^{k-1} \frac{\left( \sum_{i=1}^m \xi^i \frac{\partial}{\partial x^i} \right)^j f(\mathbf{x}_0)}{j!} + \int_0^1 \frac{\left( \sum_{i=1}^m \xi^i \frac{\partial}{\partial x^i} \right)^k f(\mathbf{x}_0 + s\mathbf{v})}{(k-1)!}(1 - s)^{k-1}\mathrm{d}s.
\end{align*} f ( x 0 + v ) = g ( 1 ) = j = 0 ∑ k − 1 j ! g ( j ) ( 0 ) 1 j + ∫ 0 1 ( k − 1 )! g ( k ) ( s ) ( 1 − s ) k − 1 d s = j = 0 ∑ k − 1 j ! ( ∑ i = 1 m ξ i ∂ x i ∂ ) j f ( x 0 ) + ∫ 0 1 ( k − 1 )! ( ∑ i = 1 m ξ i ∂ x i ∂ ) k f ( x 0 + s v ) ( 1 − s ) k − 1 d s .
f ( x 0 + v ) = ∑ j = 0 k − 1 1 j ! ∑ 1 ≤ i 1 , … , i j ≤ n ∂ j f ∂ x i 1 … ∂ x i j ( x 0 ) ξ i 1 … ξ i j + 1 k ! ∑ 1 ≤ i 1 , … , i k ≤ n ∂ k f ∂ x i 1 … ∂ x i k ( x 0 + θ v ) ξ i 1 … ξ i k , 0 < θ < 1 \begin{align*}
f(\mathbf{x}_0 + \mathbf{v}) &= \sum_{j=0}^{k-1} \frac{1}{j!} \sum_{1 \le i_1, \dots, i_j \le n} \frac{\partial^j f}{\partial x^{i_1} \dots \partial x^{i_j}}(\mathbf{x}_0)\xi^{i_1} \dots \xi^{i_j} \\
&\quad + \frac{1}{k!} \sum_{1 \le i_1, \dots, i_k \le n} \frac{\partial^k f}{\partial x^{i_1} \dots \partial x^{i_k}}(\mathbf{x}_0 + \theta \mathbf{v})\xi^{i_1} \dots \xi^{i_k}, \quad 0 < \theta < 1
\end{align*} f ( x 0 + v ) = j = 0 ∑ k − 1 j ! 1 1 ≤ i 1 , … , i j ≤ n ∑ ∂ x i 1 … ∂ x i j ∂ j f ( x 0 ) ξ i 1 … ξ i j + k ! 1 1 ≤ i 1 , … , i k ≤ n ∑ ∂ x i 1 … ∂ x i k ∂ k f ( x 0 + θ v ) ξ i 1 … ξ i k , 0 < θ < 1
f ( x 0 + v ) = ∑ j = 0 k 1 j ! ∑ 1 ≤ i 1 , … , i j ≤ n ∂ j f ∂ x i 1 … ∂ x i j ( x 0 ) ξ i 1 … ξ i j + o ( ∥ v ∥ k ) = ∑ j = 0 k ∑ α 1 + ⋯ + α n = j ∂ j f ∂ ( x 1 ) α 1 … ∂ ( x n ) α n ( x 0 ) ( ξ 1 ) α 1 … ( ξ n ) α n α 1 ! ⋯ α n ! + o ( ∥ v ∥ k ) . \begin{align*}
f(\mathbf{x}_0 + \mathbf{v}) &= \sum_{j=0}^{k} \frac{1}{j!} \sum_{1 \le i_1, \dots, i_j \le n} \frac{\partial^j f}{\partial x^{i_1} \dots \partial x^{i_j}}(\mathbf{x}_0)\xi^{i_1} \dots \xi^{i_j} + o(\|\mathbf{v}\|^k) \\
&= \sum_{j=0}^{k} \sum_{\alpha_1 + \dots + \alpha_n = j} \frac{\partial^j f}{\partial (x^1)^{\alpha_1} \dots \partial (x^n)^{\alpha_n}}(\mathbf{x}_0) \frac{(\xi^1)^{\alpha_1} \dots (\xi^n)^{\alpha_n}}{\alpha_1! \cdots \alpha_n!} + o(\|\mathbf{v}\|^k).
\end{align*} f ( x 0 + v ) = j = 0 ∑ k j ! 1 1 ≤ i 1 , … , i j ≤ n ∑ ∂ x i 1 … ∂ x i j ∂ j f ( x 0 ) ξ i 1 … ξ i j + o ( ∥ v ∥ k ) = j = 0 ∑ k α 1 + ⋯ + α n = j ∑ ∂ ( x 1 ) α 1 … ∂ ( x n ) α n ∂ j f ( x 0 ) α 1 ! ⋯ α n ! ( ξ 1 ) α 1 … ( ξ n ) α n + o ( ∥ v ∥ k ) .
Theorem 2.14. Let f f f be k k k times continuously differentiable in a neighborhood of x 0 \mathbf{x}_0 x 0 . Then a polynomial P P P of degree at most k k k satisfies
f ( x 0 + v ) = P ( v ) + o ( ∥ v ∥ k ) , v → 0 , \begin{align*}
f(\mathbf{x}_0 + \mathbf{v}) &= P(\mathbf{v}) + o(\|\mathbf{v}\|^k), \quad \mathbf{v} \to \mathbf{0},
\end{align*} f ( x 0 + v ) = P ( v ) + o ( ∥ v ∥ k ) , v → 0 ,
if and only if P P P is the k k k -th degree Taylor polynomial of f f f at x 0 \mathbf{x}_0 x 0 , that is,
T k f ( x 0 ) ( v ) = ∑ j = 0 k ∑ α 1 + ⋯ + α n = j ∂ j f ∂ ( x 1 ) α 1 … ∂ ( x n ) α n ( x 0 ) ( ξ 1 ) α 1 … ( ξ n ) α n α 1 ! … α n ! , \begin{align*}
T_k f(\mathbf{x}_0)(\mathbf{v}) &= \sum_{j=0}^{k} \sum_{\alpha_1 + \dots + \alpha_n = j} \frac{\partial^j f}{\partial (x^1)^{\alpha_1} \dots \partial (x^n)^{\alpha_n}}(\mathbf{x}_0) \frac{(\xi^1)^{\alpha_1} \dots (\xi^n)^{\alpha_n}}{\alpha_1! \dots \alpha_n!},
\end{align*} T k f ( x 0 ) ( v ) = j = 0 ∑ k α 1 + ⋯ + α n = j ∑ ∂ ( x 1 ) α 1 … ∂ ( x n ) α n ∂ j f ( x 0 ) α 1 ! … α n ! ( ξ 1 ) α 1 … ( ξ n ) α n ,
where v = ( ξ 1 , … , ξ n ) T \mathbf{v} = (\xi^1, \dots, \xi^n)^T v = ( ξ 1 , … , ξ n ) T .
2.5 Extrema and Convexity of Multivariate Functions
Definition. x 0 \mathbf{x}_0 x 0 is called a local minimum (local maximum) point of f : E → R f: E \to \mathbb{R} f : E → R , if there exists a neighborhood U U U of x 0 \mathbf{x}_0 x 0 such that for any x ∈ E ∩ U \mathbf{x} \in E \cap U x ∈ E ∩ U , f ( x ) ≥ f ( x 0 ) f(\mathbf{x}) \ge f(\mathbf{x}_0) f ( x ) ≥ f ( x 0 ) (f ( x ) ≤ f ( x 0 ) f(\mathbf{x}) \le f(\mathbf{x}_0) f ( x ) ≤ f ( x 0 ) ).
x 0 \mathbf{x}_0 x 0 is called a critical point of the function f f f , if for any v ∈ R m \mathbf{v} \in \mathbb{R}^m v ∈ R m , ∂ f ∂ v ( x 0 ) = 0 \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) = 0 ∂ v ∂ f ( x 0 ) = 0 .
x 0 \mathbf{x}_0 x 0 is called a non-degenerate critical point of a C 2 \mathscr{C}^2 C 2 function f f f , if x 0 \mathbf{x}_0 x 0 is a critical point of the function f f f , and the Hessian matrix H f ( x 0 ) H_f(\mathbf{x}_0) H f ( x 0 ) is an invertible matrix.
Theorem 2.15.
If f f f is differentiable at an extremum point x 0 \mathbf{x}_0 x 0 , then d f ( x 0 ) = 0 \mathrm{d}f(\mathbf{x}_0) = 0 d f ( x 0 ) = 0 , i.e., x 0 \mathbf{x}_0 x 0 is a critical point of f f f .
If f f f is twice continuously differentiable at a local minimum point x 0 \mathbf{x}_0 x 0 , then H f ( x 0 ) H_f(\mathbf{x}_0) H f ( x 0 ) is positive semi-definite.
If f f f is twice continuously differentiable at a local maximum point x 0 \mathbf{x}_0 x 0 , then H f ( x 0 ) H_f(\mathbf{x}_0) H f ( x 0 ) is negative semi-definite.
If H f ( x 0 ) H_f(\mathbf{x}_0) H f ( x 0 ) at a critical point x 0 \mathbf{x}_0 x 0 has both positive and negative eigenvalues, then x 0 \mathbf{x}_0 x 0 is not an extremum point. If H f ( x 0 ) H_f(\mathbf{x}_0) H f ( x 0 ) is invertible, and is neither positive definite nor negative definite, then x 0 \mathbf{x}_0 x 0 is called a saddle point of f f f . A saddle point is not an extremum point.
Theorem 2.16. Let f f f be twice continuously differentiable at a critical point x 0 \mathbf{x}_0 x 0 . Then:
(1) If H f ( x 0 ) H_f(\mathbf{x}_0) H f ( x 0 ) is positive definite at x 0 \mathbf{x}_0 x 0 , then x 0 \mathbf{x}_0 x 0 is a local minimum point of f f f ;
(2) If H f ( x 0 ) H_f(\mathbf{x}_0) H f ( x 0 ) is negative definite at x 0 \mathbf{x}_0 x 0 , then x 0 \mathbf{x}_0 x 0 is a local maximum point of f f f .
Convex Function
A set C ⊆ R m C \subseteq \mathbb{R}^m C ⊆ R m is called a convex set if for any x , y ∈ C \mathbf{x}, \mathbf{y} \in C x , y ∈ C and any 0 ≤ t ≤ 1 0 \leq t \leq 1 0 ≤ t ≤ 1 , ( 1 − t ) x + t y ∈ C (1 - t)\mathbf{x} + t\mathbf{y} \in C ( 1 − t ) x + t y ∈ C .
Let C ⊆ R m C \subseteq \mathbb{R}^m C ⊆ R m be a convex set. A function f : C → R f: C \to \mathbb{R} f : C → R is called a convex function if for any x , y ∈ C \mathbf{x}, \mathbf{y} \in C x , y ∈ C and any 0 ≤ t ≤ 1 0 \leq t \leq 1 0 ≤ t ≤ 1 ,
f ( ( 1 − t ) x + t y ) ≤ ( 1 − t ) f ( x ) + t f ( y ) . \begin{align*}
f((1 - t)\mathbf{x} + t\mathbf{y}) \leq (1 - t)f(\mathbf{x}) + t f(\mathbf{y}).
\end{align*} f (( 1 − t ) x + t y ) ≤ ( 1 − t ) f ( x ) + t f ( y ) .
A function f : C → R f: C \to \mathbb{R} f : C → R is called a concave function if − f -f − f is a convex function.
Theorem 2.17 (Second-order Taylor expansion). Let f f f be twice continuously differentiable at x 0 \mathbf{x}_0 x 0 . Then for v = ( ξ 1 , … , ξ n ) \mathbf{v} = (\xi^1, \dots, \xi^n) v = ( ξ 1 , … , ξ n ) ,
f ( x 0 + v ) = f ( x 0 ) + d f ( x 0 ) ( v ) + 1 2 v T H f ( x 0 ) v + o ( ∥ v ∥ 2 ) , ∥ v ∥ → 0. \begin{align*}
f(\mathbf{x}_0 + \mathbf{v}) &= f(\mathbf{x}_0) + \mathrm{d}f(\mathbf{x}_0)(\mathbf{v}) + \frac{1}{2}\mathbf{v}^T H_f(\mathbf{x}_0)\mathbf{v} + o(\|\mathbf{v}\|^2), \quad \|\mathbf{v}\| \to 0.
\end{align*} f ( x 0 + v ) = f ( x 0 ) + d f ( x 0 ) ( v ) + 2 1 v T H f ( x 0 ) v + o ( ∥ v ∥ 2 ) , ∥ v ∥ → 0.
In addition, as long as f f f is defined and twice continuously differentiable on the line segment connecting x 0 \mathbf{x}_0 x 0 and x 0 + v \mathbf{x}_0 + \mathbf{v} x 0 + v , there exists 0 < θ < 1 0 < \theta < 1 0 < θ < 1 such that
f ( x 0 + v ) = f ( x 0 ) + d f ( x 0 ) ( v ) + 1 2 v T H f ( x 0 + θ v ) v . \begin{align*}
f(\mathbf{x}_0 + \mathbf{v}) &= f(\mathbf{x}_0) + \mathrm{d}f(\mathbf{x}_0)(\mathbf{v}) + \frac{1}{2}\mathbf{v}^T H_f(\mathbf{x}_0 + \theta\mathbf{v})\mathbf{v}.
\end{align*} f ( x 0 + v ) = f ( x 0 ) + d f ( x 0 ) ( v ) + 2 1 v T H f ( x 0 + θ v ) v .
Theorem 2.18. If x 0 \mathbf{x}_0 x 0 is a local minimum of a strictly convex function f f f , then x 0 \mathbf{x}_0 x 0 is the global minimum of f f f , and it is the unique global minimum. If x 0 \mathbf{x}_0 x 0 is a local maximum of a strictly concave function f f f , then x 0 \mathbf{x}_0 x 0 is the global maximum of f f f , and it is the unique global maximum.
Theorem 2.19. A C 2 \mathcal{C}^2 C 2 function f f f is a convex/concave function if and only if the Hessian matrix of f f f is positive semi-definite/negative semi-definite at all points.
If the Hessian matrix of f f f is positive definite/negative definite at all points, then f f f is a strictly convex/strictly concave function.