跳到主要内容

2 Differential of Multivariable Functions

2.1 Derivatives, Differentials, and Directional Derivatives

Definition. We call x0\mathbf{x}_0 an interior point of a set ERmE \subseteq \mathbb{R}^m (we also say EE is a neighborhood of x0\mathbf{x}_0 in Rm\mathbb{R}^m) if there exists a positive number δx0>0\delta_{\mathbf{x}_0} > 0 such that

B(x0,δx0)E.\begin{align*} B(\mathbf{x}_0, \delta_{\mathbf{x}_0}) \subseteq E. \end{align*}

In other words, x0\mathbf{x}_0 and all points in its immediate vicinity are contained within EE. The set formed by all interior points of EE is called the interior of EE, denoted as intE\text{int} E.

We call EE an open set if E=intEE = \text{int} E. This implies that every point in EE is an interior point, or that EE is an open neighborhood for each of its members.

Differential

Let x0\mathbf{x}_0 be an interior point of a set ERmE \subseteq \mathbb{R}^m. We say that f:ERnf: E \to \mathbb{R}^n is differentiable at x0\mathbf{x}_0 if there exists a linear mapping A:RmRnA: \mathbb{R}^m \to \mathbb{R}^n such that:

f(x0+v)=f(x0)+Av+o(v),v0.\begin{align*} f(\mathbf{x}_0 + \mathbf{v}) = f(\mathbf{x}_0) + A\mathbf{v} + o(\|\mathbf{v}\|), \quad \mathbf{v} \to \mathbf{0}. \end{align*}

That also means:

f(x0+v)f(x0)Av=o(v),v0.\begin{align*} \|f(\mathbf{x}_0 + \mathbf{v}) - f(\mathbf{x}_0) - A\mathbf{v} \| = o(\|\mathbf{v}\|), \quad \mathbf{v} \to \mathbf{0}. \end{align*}

In this case, we call the linear mapping AA the differential of ff at x0\mathbf{x}_0, denoted as:

df(x0):RmRn,df(x0)(v)=Av.\begin{align*} df(\mathbf{x}_0): \mathbb{R}^m \to \mathbb{R}^n, \quad df(\mathbf{x}_0)(\mathbf{v}) = A\mathbf{v}. \end{align*}

Theorem 2.1. If ff is differentiable at x0\mathbf{x}_0, then ff is continuous at x0\mathbf{x}_0.

Directional Derivative

Let f:ERnf: E \to \mathbb{R}^n, x0E\mathbf{x}_0 \in E, and a vector vRm\mathbf{v} \in \mathbb{R}^m satisfy the following: there exists δ>0\delta > 0 such that for any 0t<δ0 \le t < \delta, we have x0+tvE\mathbf{x}_0 + t\mathbf{v} \in E.

If the limit

ddt(f(x0+tv))t=0=limt0f(x0+tv)f(x0)t\begin{align*} \left. \frac{d}{dt} (f(\mathbf{x}_0 + t\mathbf{v})) \right|_{t=0} = \lim_{t \to 0} \frac{f(\mathbf{x}_0 + t\mathbf{v}) - f(\mathbf{x}_0)}{t} \end{align*}

exists, then we denote it as fv(x0)\frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) or vf(x0)\partial_{\mathbf{v}}f(\mathbf{x}_0), and call it the derivative of ff at x0\mathbf{x}_0 along vector v\mathbf{v}.

In particular, when vRm\mathbf{v} \in \mathbb{R}^m is a unit vector (i.e., v=1\|\mathbf{v}\| = 1), we call it the directional derivative of ff at x0\mathbf{x}_0 along direction v\mathbf{v}.

dfds=limΔs0ΔfΔs\begin{align*} \frac{df}{ds} = \lim_{\Delta s \to 0} \frac{\Delta f}{\Delta s} \end{align*}

Theorem 2.2. If ff is differentiable at x0\mathbf{x}_0, then ff has a derivative at x0\mathbf{x}_0 along every vector v\mathbf{v}, and:

df(x0)(v)=fv(x0).\begin{align*} df(\mathbf{x}_0)(\mathbf{v}) = \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0). \end{align*}

Furthermore, in this case, the derivative fv(x0)\frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) is linear with respect to the vector v\mathbf{v}.

Theorem 2.3.

  1. Any bilinear mapping B:Rm×RnRpB: \mathbb{R}^m \times \mathbb{R}^n \to \mathbb{R}^p is differentiable. For any point (x0,y0)(\mathbf{x}_0, \mathbf{y}_0) and increment (u,v)(\mathbf{u}, \mathbf{v}), the differential is:
dB(x0,y0)(u,v)=B(x0,v)+B(u,y0).\begin{align*} dB(\mathbf{x}_0, \mathbf{y}_0)(\mathbf{u}, \mathbf{v}) = B(\mathbf{x}_0, \mathbf{v}) + B(\mathbf{u}, \mathbf{y}_0). \end{align*}
  1. Any multilinear mapping L:Rm1×Rm2××RmkRpL: \mathbb{R}^{m_1} \times \mathbb{R}^{m_2} \times \dots \times \mathbb{R}^{m_k} \to \mathbb{R}^p is differentiable, with the differential given by:
dL(x1,x2,,xk)(u1,u2,,uk)=j=1kL(x1,,uj,,xk).\begin{align*} dL(\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_k)(\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_k) = \sum_{j=1}^k L(\mathbf{x}_1, \dots, \mathbf{u}_j, \dots, \mathbf{x}_k). \end{align*}
  1. The determinant detA\det A of an n×nn \times n square matrix AA is a multilinear function of its column vectors a1,,an\mathbf{a}_1, \dots, \mathbf{a}_n. Therefore, the determinant is a differentiable function:
ddet(A)(B)=j=1ndet(a1,,bj,,an)=j=1ni=1nbjiAji=tr(ATB).\begin{align*} d\det(A)(B) = \sum_{j=1}^n \det(\mathbf{a}_1, \dots, \mathbf{b}_j, \dots, \mathbf{a}_n) = \sum_{j=1}^n \sum_{i=1}^n b_j^i A^{*i}_j = \text{tr}(A^{*T} B). \end{align*}

Theorem 2.4. Let A0L(Rn,Rn)A_0 \in \mathcal{L}(\mathbb{R}^n, \mathbb{R}^n) be an invertible matrix, and define the open set UU as:

U={AL(Rn,Rn)AA0<1A01}.\begin{align*} U = \left\{ A \in \mathcal{L}(\mathbb{R}^n, \mathbb{R}^n) \mid \|A - A_0\| < \frac{1}{\|A_0^{-1}\|} \right\}. \end{align*}

The inversion mapping ff is differentiable on UU.

f:UL(Rn,Rn),f(A)=A1.\begin{align*} f: U \to \mathcal{L}(\mathbb{R}^n, \mathbb{R}^n), \quad f(A) = A^{-1}. \end{align*}

and

df(A)(B)=A1BA1.\begin{align*} df(A)(B) = -A^{-1} B A^{-1}. \end{align*}

Theorem 2.5 (Chain Rule for Derivatives of Composite Functions). Let ff be differentiable at x0\mathbf{x}_0, and let gg be differentiable at y0=f(x0)\mathbf{y}_0 = f(\mathbf{x}_0). Then the composite function gfg \circ f is differentiable at x0\mathbf{x}_0, and the differential of the composite is equal to the composite of the differentials:

d(gf)(x0)=dg(y0)df(x0)=dg(f(x0))df(x0).\begin{align*} d(g \circ f)(\mathbf{x}_0) = dg(\mathbf{y}_0) \circ df(\mathbf{x}_0) = dg(f(\mathbf{x}_0)) \circ df(\mathbf{x}_0). \end{align*}

2.2 Coordinate Systems and Partial Derivatives

Partial Derivative

Let (x1,x2,,xm)(x^1, x^2, \dots, x^m) be the coordinates of a point in EE. Suppose that at an interior point x0=(a1,,am)\mathbf{x}_0 = (a^1, \dots, a^m) of EE, the mm-variable function f:ERf: E \to \mathbb{R} has the following limit:

limt0f(a1,,ak+t,,am)f(a1,,ak,,am)t,\begin{align*} \lim_{t \to 0} \frac{f(a^1, \dots, a^k + t, \dots, a^m) - f(a^1, \dots, a^k, \dots, a^m)}{t}, \end{align*}

Then the value of this limit is denoted by fxk(x0)\frac{\partial f}{\partial x^k}(\mathbf{x}_0), fxk(x0)f_{x^k}(\mathbf{x}_0) (written as fxk(x0)f'_{x^k}(\mathbf{x}_0) in classical textbooks), xkf(x0)\partial_{x^k} f(\mathbf{x}_0), or kf(x0)\partial_k f(\mathbf{x}_0), and is called the partial derivative of ff with respect to the coordinate xkx^k.

Theorem 2.6. Let the mm-variable function ff be differentiable at x0\mathbf{x}_0. Then for any v=(ξ1,,ξm)TRm\mathbf{v} = (\xi^1, \dots, \xi^m)^T \in \mathbb{R}^m,

df(x0)(v)=fv(x0)=ξ1fx1(x0)++ξmfxm(x0)=(fx1(x0),,fxm(x0))(ξ1ξm).\begin{align*} df(\mathbf{x}_0)(\mathbf{v}) &= \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) = \xi^1 \frac{\partial f}{\partial x^1}(\mathbf{x}_0) + \dots + \xi^m \frac{\partial f}{\partial x^m}(\mathbf{x}_0) \\ &= \left( \frac{\partial f}{\partial x^1}(\mathbf{x}_0), \dots, \frac{\partial f}{\partial x^m}(\mathbf{x}_0) \right) \begin{pmatrix} \xi^1 \\ \vdots \\ \xi^m \end{pmatrix}. \end{align*}

Therefore, the differential of the function ff is

df(x0)=fx1(x0)dx1++fxm(x0)dxm,\begin{align*} df(\mathbf{x}_0) = \frac{\partial f}{\partial x^1}(\mathbf{x}_0) dx^1 + \dots + \frac{\partial f}{\partial x^m}(\mathbf{x}_0) dx^m, \end{align*}

where

dxi:RmR,dxi(v)=ξi,\begin{align*} dx^i : \mathbb{R}^m \to \mathbb{R}, \quad dx^i(\mathbf{v}) = \xi^i, \end{align*}

is the differential of the coordinate function xi:RmR,(x1,,xm)xix^i : \mathbb{R}^m \to \mathbb{R}, \quad (x^1, \dots, x^m) \mapsto x^i; they are the coordinate functions of the vector.

Jacobian Matrix

If the mm-variable mapping F:ERnF: E \to \mathbb{R}^n,

F(x1,,xm)=(f1(x1,,xm),,fn(x1,,xm))T\begin{align*} F(x^1, \dots, x^m) = (f^1(x^1, \dots, x^m), \dots, f^n(x^1, \dots, x^m))^T \end{align*}

is differentiable at x0\mathbf{x}_0, then for any v=(ξ1,,ξm)Rm\mathbf{v} = (\xi^1, \dots, \xi^m) \in \mathbb{R}^m,

dF(x0)(v)=(f1x1(x0)f1xm(x0)fnx1(x0)fnxm(x0))(ξ1ξm).\begin{align*} dF(\mathbf{x}_0)(\mathbf{v}) = \begin{pmatrix} \frac{\partial f^1}{\partial x^1}(\mathbf{x}_0) & \cdots & \frac{\partial f^1}{\partial x^m}(\mathbf{x}_0) \\ \vdots & \ddots & \vdots \\ \frac{\partial f^n}{\partial x^1}(\mathbf{x}_0) & \cdots & \frac{\partial f^n}{\partial x^m}(\mathbf{x}_0) \end{pmatrix} \begin{pmatrix} \xi^1 \\ \vdots \\ \xi^m \end{pmatrix}. \end{align*}

Therefore, the coordinate representation of the differential of the mapping FF is

(f1x1(x0)f1xm(x0)fnx1(x0)fnxm(x0)),\begin{align*} \begin{pmatrix} \frac{\partial f^1}{\partial x^1}(\mathbf{x}_0) & \cdots & \frac{\partial f^1}{\partial x^m}(\mathbf{x}_0) \\ \vdots & \ddots & \vdots \\ \frac{\partial f^n}{\partial x^1}(\mathbf{x}_0) & \cdots & \frac{\partial f^n}{\partial x^m}(\mathbf{x}_0) \end{pmatrix}, \end{align*}

Denote this matrix by JF(x0)=(y1,,yn)(x1,,xm)x0JF(\mathbf{x}_0) = \left. \frac{\partial(y^1, \dots, y^n)}{\partial(x^1, \dots, x^m)} \right|_{\mathbf{x}_0}, called the Jacobian matrix of FF at x0\mathbf{x}_0. The determinant of JF(x0)JF(\mathbf{x}_0), detJF(x0)\det JF(\mathbf{x}_0), is called the Jacobian determinant of FF at x0\mathbf{x}_0.

Theorem 2.7. Let GG be differentiable at x0\mathbf{x}_0, and FF be differentiable at y0=G(x0)\mathbf{y}_0 = G(\mathbf{x}_0), then

J(FG)(x0)=JF(y0)JG(x0).\begin{align*} J(F \circ G)(\mathbf{x}_0) = JF(\mathbf{y}_0) \cdot JG(\mathbf{x}_0). \end{align*}

If the inverse mapping F1F^{-1} of a differentiable mapping FF is also differentiable, then

J(F1)(y0)=(JF(x0))1.\begin{align*} J(F^{-1})(\mathbf{y}_0) = (JF(\mathbf{x}_0))^{-1}. \end{align*}

Theorem 2.8. If all first-order partial derivatives fx1(x),,fxm(x)\frac{\partial f}{\partial x^1}(\mathbf{x}), \dots, \frac{\partial f}{\partial x^m}(\mathbf{x}) are continuous, then ff is differentiable. (It is not a necessary condition.)

2.3 Gradient and Directional Derivative

Theorem 2.9. Let ,\langle \cdot, \cdot \rangle be an inner product on Rm\mathbb{R}^m. Then for any linear function L:RmRL: \mathbb{R}^m \to \mathbb{R} on Rm\mathbb{R}^m, there exists a unique vector LRm\nabla L \in \mathbb{R}^m such that

L(v)=v,L,vRm.\begin{align*} L(\mathbf{v}) = \langle \mathbf{v}, \nabla L \rangle, \quad \forall \mathbf{v} \in \mathbb{R}^m. \end{align*}

Consequently, L\nabla L is orthogonal to KerL\operatorname{Ker} L, and L=L=maxv=1L(v)\|\nabla L\| = \|L\| = \max_{\|\mathbf{v}\|=1} L(\mathbf{v}). A unit vector v\mathbf{v} satisfies L(v)=LL(\mathbf{v}) = \|\nabla L\| if and only if v\mathbf{v} is the unit vector in the direction of L\nabla L. This unique vector L\nabla L is called the gradient vector of the linear function LL.

Gradient Vector

Let EE be a subset of an inner product space, and the function f:ERf : E \to \mathbb{R} be differentiable at x0\mathbf{x}_0. Denote the gradient vector of df(x0)df(\mathbf{x}_0) as gradf(x0)\operatorname{grad} f(\mathbf{x}_0) or f(x0)\nabla f(\mathbf{x}_0), called the gradient vector of ff at x0\mathbf{x}_0. Thus,

df(x0)(v)=v,f(x0).\begin{align*} df(\mathbf{x}_0)(\mathbf{v}) = \langle \mathbf{v}, \nabla f(\mathbf{x}_0) \rangle. \end{align*}

Theorem 2.10. Let ff be differentiable at x0\mathbf{x}_0. Then for any vector vRm\mathbf{v} \in \mathbb{R}^m,

fv(x0)=df(x0)(v)=v,f(x0).\begin{align*} \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) = df(\mathbf{x}_0)(\mathbf{v}) = \langle \mathbf{v}, \nabla f(\mathbf{x}_0) \rangle. \end{align*}

Consequently, for any unit vector v\mathbf{v},

fv(x0)f(x0),\begin{align*} \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) \le \|\nabla f(\mathbf{x}_0)\|, \end{align*}

where equality holds if and only if v\mathbf{v} is in the direction of the gradient of ff at x0\mathbf{x}_0. Therefore, the gradient direction of ff is the direction in which ff increases most rapidly.

Theorem 2.11. In any coordinate system (x1,,xm)(x^1, \dots, x^m), the differential of a differentiable function ff is a row vector:

df(x)=(fx1(x),fx2(x),,fxm(x)).\begin{align*} df(\mathbf{x}) = \left( \frac{\partial f}{\partial x^1}(\mathbf{x}), \frac{\partial f}{\partial x^2}(\mathbf{x}), \dots, \frac{\partial f}{\partial x^m}(\mathbf{x}) \right). \end{align*}

In a Cartesian coordinate system (x1,,xm)(x^1, \dots, x^m), the gradient of a differentiable function ff is a column vector:

f(x)=(fx1(x)fx2(x)fxm(x)).\begin{align*} \nabla f(\mathbf{x}) = \begin{pmatrix} \frac{\partial f}{\partial x^1}(\mathbf{x}) \\ \frac{\partial f}{\partial x^2}(\mathbf{x}) \\ \vdots \\ \frac{\partial f}{\partial x^m}(\mathbf{x}) \end{pmatrix}. \end{align*}

2.4 Higher-Order Partial Derivatives and Taylor Expansion

Higher-Order Partial Derivative

For i1,,ik{1,2,,m}i_1, \dots, i_k \in \{1, 2, \dots, m\}, denote

kf(xik)(xi1)(x0)=(xik)(f(xi1))(x0).\begin{align*} \frac{\partial^k f}{\partial (x^{i_k}) \cdots \partial (x^{i_1})}(\mathbf{x}_0) = \frac{\partial}{\partial (x^{i_k})} \left( \cdots \frac{\partial f}{\partial (x^{i_1})} \right) (\mathbf{x}_0). \end{align*}

This is called a kk-th order partial derivative of ff at x0\mathbf{x}_0.

Higher-order partial derivatives are sometimes also denoted by symbols such as: xik,,xi1kf\partial^k_{x^{i_k}, \dots, x^{i_1}} f, ik,,i2,i1kf\partial^k_{i_k, \dots, i_2, i_1} f, fxi1,,xik(k)f^{(k)}_{x^{i_1}, \dots, x^{i_k}}, or even fxi1,,xikf_{x^{i_1}, \dots, x^{i_k}}.

If for any i1,,ik{1,2,,m}i_1, \dots, i_k \in \{1, 2, \dots, m\}, the partial derivative functions kfxikxi1(x)\frac{\partial^k f}{\partial x^{i_k} \cdots \partial x^{i_1}}(\mathbf{x}) are all continuous, that is, any kk-th order partial derivative of ff is continuous, then ff is called a Ck\mathscr{C}^k function, denoted as fCkf \in \mathscr{C}^k.

If for any positive integer kk, fCkf \in \mathscr{C}^k, it is denoted as fCf \in \mathscr{C}^\infty.

For a multivariate mapping F(x1,,xm)=(f1(x1,,xm),,fn(x1,,xm))TF(x^1, \dots, x^m) = (f^1(x^1, \dots, x^m), \dots, f^n(x^1, \dots, x^m))^T, FF is said to be Ck\mathscr{C}^k (or C\mathscr{C}^\infty) if each of its component functions fif^i is Ck\mathscr{C}^k (or C\mathscr{C}^\infty).

Theorem 2.12.

(1) If functions f,gf, g are both Ck\mathscr{C}^k, then f+g,fgf + g, fg are also Ck\mathscr{C}^k.

(2) If f,gf, g in the composite mapping gfg \circ f are both Ck\mathscr{C}^k, then the mapping gfg \circ f is also Ck\mathscr{C}^k.

(3) If functions f,gf, g are both Ck\mathscr{C}^k, and g(x1,,xm)0g(x^1, \dots, x^m) \neq 0, then f/gf/g is also Ck\mathscr{C}^k.

Theorem 2.13. For any Ck\mathscr{C}^k function ff, for any i1,,ik{1,2,,m}i_1, \dots, i_k \in \{1, 2, \dots, m\} and any permutation π\pi of 1,2,,k1, 2, \dots, k,

kfxiπ(k)xiπ(1)(x)=kfxikxi1(x).\begin{align*} \frac{\partial^k f}{\partial x^{i_{\pi(k)}} \dots \partial x^{i_{\pi(1)}}}(\mathbf{x}) &= \frac{\partial^k f}{\partial x^{i_k} \dots \partial x^{i_1}}(\mathbf{x}). \end{align*}

Higher-Order Differentials

The first-order differential df(x)\mathrm{d}f(\mathbf{x}) of a function ff at x\mathbf{x} is a linear function with respect to a vector v\mathbf{v}:

df(x)(v)=fv(x)=ifxi(x)vi.\begin{align*} \mathrm{d}f(\mathbf{x})(\mathbf{v}) &= \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}) = \sum_i \frac{\partial f}{\partial x^i}(\mathbf{x})v^i. \end{align*}

Differentiating it again with respect to x\mathbf{x} yields the second-order differential of ff, which is a bilinear function with respect to a pair of vectors v,w\mathbf{v, w}:

d2f(x)(v,w)=wfv(x)=i(j2fxjxi(x)wj)vi=wTHf(x)v,\begin{align*} \mathrm{d}^2f(\mathbf{x})(\mathbf{v, w}) &= \frac{\partial}{\partial \mathbf{w}} \frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}) = \sum_i \left( \sum_j \frac{\partial^2 f}{\partial x^j \partial x^i}(\mathbf{x})w^j \right) v^i = \mathbf{w}^T H_f(\mathbf{x}) \mathbf{v}, \end{align*}

Hessian Matrix

where the coefficient matrix

Hf(x)=(2f(x1)2(x)2fx1x2(x)2fx1xm(x)2fx2x1(x)2f(x2)2(x)2fx2xm(x)2fxmx1(x)2fxmx2(x)2f(xm)2(x)),\begin{align*} H_f(\mathbf{x}) = \begin{pmatrix} \frac{\partial^2 f}{\partial(x^1)^2}(\mathbf{x}) & \frac{\partial^2 f}{\partial x^1 \partial x^2}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial x^1 \partial x^m}(\mathbf{x}) \\ \frac{\partial^2 f}{\partial x^2 \partial x^1}(\mathbf{x}) & \frac{\partial^2 f}{\partial(x^2)^2}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial x^2 \partial x^m}(\mathbf{x}) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x^m \partial x^1}(\mathbf{x}) & \frac{\partial^2 f}{\partial x^m \partial x^2}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial(x^m)^2}(\mathbf{x}) \end{pmatrix}, \end{align*}

is called the Hessian matrix of the function ff at x\mathbf{x}.

In general, the kk-th order differential of the function ff is a kk-multilinear function with respect to kk vectors v1,v2,,vk\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_k:

dkf(x)(v1,v2,,vk)=i1,i2,,ikkfxikxi2xi1(x)v1i1v2i2vkik=kfvkv1(x).\begin{align*} \mathrm{d}^k f(\mathbf{x})(\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_k) &= \sum_{i_1, i_2, \dots, i_k} \frac{\partial^k f}{\partial x^{i_k} \partial \dots x^{i_2} \partial x^{i_1}}(\mathbf{x}) v_1^{i_1} v_2^{i_2} \dots v_k^{i_k} \\ &= \frac{\partial^k f}{\partial \mathbf{v}_k \dots \partial \mathbf{v}_1}(\mathbf{x}). \end{align*}

When fCkf \in \mathscr{C}^k, dkf(x)\mathrm{d}^k f(\mathbf{x}) is a symmetric kk-multilinear function.

Taylor Expansion

For a Ck\mathscr{C}^k function ff, consider g(t)=f(x0+tv)g(t) = f(\mathbf{x}_0 + t\mathbf{v}). Then gg is Ck\mathscr{C}^k,

In general,

g(k)(t)=1i1,,ikmξi1ξikkfxi1xik(x0+tv)=(i=1mξixi)kf(x0+tv).\begin{align*} g^{(k)}(t) &= \sum_{1 \le i_1, \dots, i_k \le m} \xi^{i_1} \dots \xi^{i_k} \frac{\partial^k f}{\partial x^{i_1} \dots \partial x^{i_k}}(\mathbf{x}_0 + t\mathbf{v}) = \left( \sum_{i=1}^m \xi^i \frac{\partial}{\partial x^i} \right)^k f(\mathbf{x}_0 + t\mathbf{v}). \end{align*}

where

(i=1mξixi)k=i1,,ik{1,2,,m}ξi1ξikkxi1xik.\begin{align*} \left( \sum_{i=1}^m \xi^i \frac{\partial}{\partial x^i} \right)^k &= \sum_{i_1, \dots, i_k \in \{1, 2, \dots, m\}} \xi^{i_1} \dots \xi^{i_k} \frac{\partial^k}{\partial x^{i_1} \dots \partial x^{i_k}}. \end{align*}

Thus

f(x0+v)=g(1)=j=0k1g(j)(0)j!1j+01g(k)(s)(k1)!(1s)k1ds=j=0k1(i=1mξixi)jf(x0)j!+01(i=1mξixi)kf(x0+sv)(k1)!(1s)k1ds.\begin{align*} f(\mathbf{x}_0 + \mathbf{v}) = g(1) &= \sum_{j=0}^{k-1} \frac{g^{(j)}(0)}{j!} 1^j + \int_0^1 \frac{g^{(k)}(s)}{(k-1)!}(1 - s)^{k-1}\mathrm{d}s \\ &= \sum_{j=0}^{k-1} \frac{\left( \sum_{i=1}^m \xi^i \frac{\partial}{\partial x^i} \right)^j f(\mathbf{x}_0)}{j!} + \int_0^1 \frac{\left( \sum_{i=1}^m \xi^i \frac{\partial}{\partial x^i} \right)^k f(\mathbf{x}_0 + s\mathbf{v})}{(k-1)!}(1 - s)^{k-1}\mathrm{d}s. \end{align*} f(x0+v)=j=0k11j!1i1,,ijnjfxi1xij(x0)ξi1ξij+1k!1i1,,iknkfxi1xik(x0+θv)ξi1ξik,0<θ<1\begin{align*} f(\mathbf{x}_0 + \mathbf{v}) &= \sum_{j=0}^{k-1} \frac{1}{j!} \sum_{1 \le i_1, \dots, i_j \le n} \frac{\partial^j f}{\partial x^{i_1} \dots \partial x^{i_j}}(\mathbf{x}_0)\xi^{i_1} \dots \xi^{i_j} \\ &\quad + \frac{1}{k!} \sum_{1 \le i_1, \dots, i_k \le n} \frac{\partial^k f}{\partial x^{i_1} \dots \partial x^{i_k}}(\mathbf{x}_0 + \theta \mathbf{v})\xi^{i_1} \dots \xi^{i_k}, \quad 0 < \theta < 1 \end{align*} f(x0+v)=j=0k1j!1i1,,ijnjfxi1xij(x0)ξi1ξij+o(vk)=j=0kα1++αn=jjf(x1)α1(xn)αn(x0)(ξ1)α1(ξn)αnα1!αn!+o(vk).\begin{align*} f(\mathbf{x}_0 + \mathbf{v}) &= \sum_{j=0}^{k} \frac{1}{j!} \sum_{1 \le i_1, \dots, i_j \le n} \frac{\partial^j f}{\partial x^{i_1} \dots \partial x^{i_j}}(\mathbf{x}_0)\xi^{i_1} \dots \xi^{i_j} + o(\|\mathbf{v}\|^k) \\ &= \sum_{j=0}^{k} \sum_{\alpha_1 + \dots + \alpha_n = j} \frac{\partial^j f}{\partial (x^1)^{\alpha_1} \dots \partial (x^n)^{\alpha_n}}(\mathbf{x}_0) \frac{(\xi^1)^{\alpha_1} \dots (\xi^n)^{\alpha_n}}{\alpha_1! \cdots \alpha_n!} + o(\|\mathbf{v}\|^k). \end{align*}

Theorem 2.14. Let ff be kk times continuously differentiable in a neighborhood of x0\mathbf{x}_0. Then a polynomial PP of degree at most kk satisfies

f(x0+v)=P(v)+o(vk),v0,\begin{align*} f(\mathbf{x}_0 + \mathbf{v}) &= P(\mathbf{v}) + o(\|\mathbf{v}\|^k), \quad \mathbf{v} \to \mathbf{0}, \end{align*}

if and only if PP is the kk-th degree Taylor polynomial of ff at x0\mathbf{x}_0, that is,

Tkf(x0)(v)=j=0kα1++αn=jjf(x1)α1(xn)αn(x0)(ξ1)α1(ξn)αnα1!αn!,\begin{align*} T_k f(\mathbf{x}_0)(\mathbf{v}) &= \sum_{j=0}^{k} \sum_{\alpha_1 + \dots + \alpha_n = j} \frac{\partial^j f}{\partial (x^1)^{\alpha_1} \dots \partial (x^n)^{\alpha_n}}(\mathbf{x}_0) \frac{(\xi^1)^{\alpha_1} \dots (\xi^n)^{\alpha_n}}{\alpha_1! \dots \alpha_n!}, \end{align*}

where v=(ξ1,,ξn)T\mathbf{v} = (\xi^1, \dots, \xi^n)^T.

2.5 Extrema and Convexity of Multivariate Functions

Definition. x0\mathbf{x}_0 is called a local minimum (local maximum) point of f:ERf: E \to \mathbb{R}, if there exists a neighborhood UU of x0\mathbf{x}_0 such that for any xEU\mathbf{x} \in E \cap U, f(x)f(x0)f(\mathbf{x}) \ge f(\mathbf{x}_0) (f(x)f(x0)f(\mathbf{x}) \le f(\mathbf{x}_0)).

x0\mathbf{x}_0 is called a critical point of the function ff, if for any vRm\mathbf{v} \in \mathbb{R}^m, fv(x0)=0\frac{\partial f}{\partial \mathbf{v}}(\mathbf{x}_0) = 0.

x0\mathbf{x}_0 is called a non-degenerate critical point of a C2\mathscr{C}^2 function ff, if x0\mathbf{x}_0 is a critical point of the function ff, and the Hessian matrix Hf(x0)H_f(\mathbf{x}_0) is an invertible matrix.

Theorem 2.15.

  1. If ff is differentiable at an extremum point x0\mathbf{x}_0, then df(x0)=0\mathrm{d}f(\mathbf{x}_0) = 0, i.e., x0\mathbf{x}_0 is a critical point of ff.
  2. If ff is twice continuously differentiable at a local minimum point x0\mathbf{x}_0, then Hf(x0)H_f(\mathbf{x}_0) is positive semi-definite.
  3. If ff is twice continuously differentiable at a local maximum point x0\mathbf{x}_0, then Hf(x0)H_f(\mathbf{x}_0) is negative semi-definite.
  4. If Hf(x0)H_f(\mathbf{x}_0) at a critical point x0\mathbf{x}_0 has both positive and negative eigenvalues, then x0\mathbf{x}_0 is not an extremum point. If Hf(x0)H_f(\mathbf{x}_0) is invertible, and is neither positive definite nor negative definite, then x0\mathbf{x}_0 is called a saddle point of ff. A saddle point is not an extremum point.

Theorem 2.16. Let ff be twice continuously differentiable at a critical point x0\mathbf{x}_0. Then:

(1) If Hf(x0)H_f(\mathbf{x}_0) is positive definite at x0\mathbf{x}_0, then x0\mathbf{x}_0 is a local minimum point of ff;

(2) If Hf(x0)H_f(\mathbf{x}_0) is negative definite at x0\mathbf{x}_0, then x0\mathbf{x}_0 is a local maximum point of ff.

Convex Function

A set CRmC \subseteq \mathbb{R}^m is called a convex set if for any x,yC\mathbf{x}, \mathbf{y} \in C and any 0t10 \leq t \leq 1, (1t)x+tyC(1 - t)\mathbf{x} + t\mathbf{y} \in C.

Let CRmC \subseteq \mathbb{R}^m be a convex set. A function f:CRf: C \to \mathbb{R} is called a convex function if for any x,yC\mathbf{x}, \mathbf{y} \in C and any 0t10 \leq t \leq 1,

f((1t)x+ty)(1t)f(x)+tf(y).\begin{align*} f((1 - t)\mathbf{x} + t\mathbf{y}) \leq (1 - t)f(\mathbf{x}) + t f(\mathbf{y}). \end{align*}

A function f:CRf: C \to \mathbb{R} is called a concave function if f-f is a convex function.

Theorem 2.17 (Second-order Taylor expansion). Let ff be twice continuously differentiable at x0\mathbf{x}_0. Then for v=(ξ1,,ξn)\mathbf{v} = (\xi^1, \dots, \xi^n),

f(x0+v)=f(x0)+df(x0)(v)+12vTHf(x0)v+o(v2),v0.\begin{align*} f(\mathbf{x}_0 + \mathbf{v}) &= f(\mathbf{x}_0) + \mathrm{d}f(\mathbf{x}_0)(\mathbf{v}) + \frac{1}{2}\mathbf{v}^T H_f(\mathbf{x}_0)\mathbf{v} + o(\|\mathbf{v}\|^2), \quad \|\mathbf{v}\| \to 0. \end{align*}

In addition, as long as ff is defined and twice continuously differentiable on the line segment connecting x0\mathbf{x}_0 and x0+v\mathbf{x}_0 + \mathbf{v}, there exists 0<θ<10 < \theta < 1 such that

f(x0+v)=f(x0)+df(x0)(v)+12vTHf(x0+θv)v.\begin{align*} f(\mathbf{x}_0 + \mathbf{v}) &= f(\mathbf{x}_0) + \mathrm{d}f(\mathbf{x}_0)(\mathbf{v}) + \frac{1}{2}\mathbf{v}^T H_f(\mathbf{x}_0 + \theta\mathbf{v})\mathbf{v}. \end{align*}

Theorem 2.18. If x0\mathbf{x}_0 is a local minimum of a strictly convex function ff, then x0\mathbf{x}_0 is the global minimum of ff, and it is the unique global minimum. If x0\mathbf{x}_0 is a local maximum of a strictly concave function ff, then x0\mathbf{x}_0 is the global maximum of ff, and it is the unique global maximum.

Theorem 2.19. A C2\mathcal{C}^2 function ff is a convex/concave function if and only if the Hessian matrix of ff is positive semi-definite/negative semi-definite at all points.

If the Hessian matrix of ff is positive definite/negative definite at all points, then ff is a strictly convex/strictly concave function.