IMM
METHODS FOR
NON-LINEAR LEAST
SQUARES PROBLEMS
2nd Edition, April 2004
K. Madsen, H.B. Nielsen, O. Tingleff
Informatics and Mathematical Modelling
Technical University of Denmark
CONTENTS
1. INTRODUCTION AND DEFINITIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. DESCENT METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1. The Steepest Descent method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2. Newton’s Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
2.3. Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4. Trust Region and Damped Methods . . . . . . . . . . . . . . . . . . . . . . . . 11
3. NON-LINEAR LEAST SQUARES PROBLEMS . . . . . . . . . . . . . . . . . . . . . 17
3.1. The Gauss–Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2. The Levenberg–Marquardt Method . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3. Powell’s Dog Leg Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4. A Hybrid Method: L–M and Quasi–Newton. . . . . . . . . . . . . . . . .34
3.5. A Secant Version of the L–M Method . . . . . . . . . . . . . . . . . . . . . . 40
3.6. A Secant Version of the Dog Leg Method . . . . . . . . . . . . . . . . . . . 45
3.7. Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
1. INTRODUCTION AND DEFINITIONS
In this booklet we consider the following problem,
Definition 1.1. Least Squares Problem
Find x⁄, a local minimizer for1)
F (x) = 12
mX
i=1
(fi(x))
2
;
where fi : IRn 7! IR; i= 1; : : : ;m are given functions, and m‚n.
Example 1.1. An important source of least squares problems is data fitting. As an
example consider the data points (t1; y1); : : : ; (tm; ym) shown below
t
y
Figure 1.1. Data points f(ti; yi)g (marked by +)
and model M(x; t) (marked by full line.)
Further, we are given a fitting model,
M(x; t) = x3e
x1t + x4e
x2t :
1) The factor 1
2
in the definition of F (x) has no effect on x⁄. It is introduced for conve-
nience, see page 18.
1. INTRODUCTION AND DEFINITIONS 2
The model depends on the parameters x = [x1; x2; x3; x4]>. We assume that
there exists an xy so that
yi = M(x
y; ti) + "i ;
where the f"ig are (measurement) errors on the data ordinates, assumed to be-
have like “white noise”.
For any choice of x we can compute the residuals
fi(x) = yi ¡M(x; ti)
= yi ¡ x3ex1ti ¡ x4ex2ti ; i= 1; : : : ;m :
For a least squares fit the parameters are determined as the minimizer x⁄ of the
sum of squared residuals. This is seen to be a problem of the form in Defini-
tion 1.1 with n= 4. The graph of M(x⁄; t) is shown by full line in Figure 1.1.
A least squares problem is a special variant of the more general problem:
Given a function F : IRn 7!IR, find an argument of F that gives the minimum
value of this so-called objective function or cost function.
Definition 1.2. Global Minimizer
Given F : IRn 7! IR. Find
x+ = argminxfF (x)g :
This problem is very hard to solve in general, and we only present meth-
ods for solving the simpler problem of finding a local minimizer for F , an
argument vector which gives a minimum value of F inside a certain region
whose size is given by –, where – is a small, positive number.
Definition 1.3. Local Minimizer
Given F : IRn 7! IR. Find x⁄ so that
F (x⁄) • F (x) for kx¡ x⁄k < – :
In the remainder of this introduction we shall discuss some basic concepts in
optimization, and Chapter 2 is a brief review of methods for finding a local
3 1. INTRODUCTION AND DEFINITIONS
minimizer for general cost functions. For more details we refer to Frandsen
et al (2004). In Chapter 3 we give methods that are specially tuned for least
squares problems.
We assume that the cost function F is differentiable and so smooth that the
following Taylor expansion is valid,2)
F (x+h) = F (x) + h>g + 1
2
h>H h +O(khk3) ; (1.4a)
where g is the gradient,
g · F 0(x) =
2666664
@F
@x1
(x)
.
.
.
@F
@xn
(x)
3777775 ; (1.4b)
and H is the Hessian,
H · F 00(x) =
•
@2F
@xi@xj
(x)
‚
: (1.4c)
If x⁄ is a local minimizer and khk is sufficiently small, then we cannot find a
point x⁄+h with a smaller F -value. Combining this observation with (1.4a)
we get
Theorem 1.5. Necessary condition for a local minimizer.
If x⁄ is a local minimizer, then
g⁄ · F 0(x⁄) = 0 :
We use a special name for arguments that satisfy the necessary condition:
Definition 1.6. Stationary point. If
gs · F 0(xs) = 0 ;
then xs is said to be a stationary point for F .
2) Unless otherwise specified, k ¢ k denotes the 2-norm, khk =
q
h21 + ¢ ¢ ¢+ h2n.
1. INTRODUCTION AND DEFINITIONS 4
Thus, a local minimizer is also a stationary point, but so is a local maximizer.
A stationary point which is neither a local maximizer nor a local minimizer
is called a saddle point. In order to determine whether a given stationary
point is a local minimizer or not, we need to include the second order term
in the Taylor series (1.4a). Inserting xs we see that
F (xs+h) = F (xs) + 12h
>Hs h +O(khk3)
with Hs = F 00(xs) :
(1.7)
From definition (1.4c) of the Hessian it follows that any H is a symmetric
matrix. If we request that Hs is positive definite, then its eigenvalues are
greater than some number – > 0 (see Appendix A), and
h>Hs h > – khk2 :
This shows that for khk sufficiently small the third term on the right-hand
side of (1.7) will be dominated by the second. This term is positive, so that
we get
Theorem 1.8. Sufficient condition for a local minimizer.
Assume that xs is a stationary point and that F 00(xs) is positive definite.
Then xs is a local minimizer.
If Hs is negative definite, then xs is a local maximizer. If Hs is indefinite (ie
it has both positive and negative eigenvalues), then xs is a saddle point.
2. DESCENT METHODS
All methods for non-linear optimization are iterative: From a starting point
x0 the method produces a series of vectors x1;x2; : : :, which (hopefully)
converges to x⁄, a local minimizer for the given function, see Definition 1.3.
Most methods have measures which enforce the descending condition
F (xk+1) < F (xk) : (2.1)
This prevents convergence to a maximizer and also makes it less probable
that we converge towards a saddle point. If the given function has several
minimizers the result will depend on the starting point x0. We do not know
which of the minimizers that will be found; it is not necessarily the mini-
mizer closest to x0.
In many cases the method produces vectors which converge towards the
minimizer in two clearly different stages. When x0 is far from the solution
we want the method to produce iterates which move steadily towards x⁄.
In this “global stage” of the iteration we are satisfied if the errors do not
increase except in the very first steps, ie
kek+1k < kekk for k >K ;
where ek denotes the current error,
ek = xk ¡ x⁄ : (2.2)
In the final stage of the iteration, where xk is close to x⁄, we want faster
convergence. We distinguish between
Linear convergence:
kek+1k • akekk when kekk is small; 0 < a < 1 ; (2.3a)
2. DESCENT METHODS 6
Quadratic convergence:
kek+1k = O(kekk2) when kekk is small ; (2.3b)
Superlinear convergence:
kek+1k=kekk ! 0 for k!1 : (2.3c)
The methods presented in this lecture note are descent methods which sat-
isfy the descending condition (2.1) in each step of the iteration. One step
from the current iterate consists in
1. Find a descent direction hd (discussed below), and
2. find a step length giving a good decrease in the F -value.
Thus an outline of a descent method is
Algorithm 2.4. Descent method
begin
k := 0; x := x0; found := false fStarting pointg
while (not found) and (k < kmax)
hd := search direction(x) fFrom x and downhillg
if (no such h exists)
found := true fx is stationaryg
else
fi := step length(x;hd) ffrom x in direction hdg
x := x + fihd; k := k+1 fnext iterateg
end
Consider the variation of the F -value along the half line starting at x and
with direction h. From the Taylor expansion (1.4a) we see that
F (x+fih) = F (x) + fih>F 0(x) +O(fi2)
’ F (x) + fih>F 0(x) for fi sufficiently small. (2.5)
We say that h is a descent direction if F (x+fih) is a decreasing function of
fi at fi= 0. This leads to the following definition.
Administrator
线条
Administrator
线条
7 2. DESCENT METHODS
Definition 2.6. Descent direction.
h is a descent direction for F at x if h>F 0(x) < 0 :
If no such h exists, then F 0(x) = 0, showing that in this case x is stationary.
Otherwise, we have to choose fi, ie how far we should go from x in the
direction given by hd, so that we get a decrease in the value of the objective
function. One way of doing this is to find (an approximation to)
fie = argminfi>0fF (x+fih)g : (2.7)
The process is called line search, and is discussed in Section 2.3. First,
however, we shall introduce two methods for computing a descent direction.
2.1. The Steepest Descent method
From (2.5) we see that when we perform a step fih with positive fi, then the
relative gain in function value satisfies
lim
fi!0
F (x)¡ F (x+fih)
fikhk = ¡
1
khk h
>F 0(x) = ¡kF 0(x)k cos µ ;
where µ is the angle between the vectors h and F 0(x). This shows that we
get the greatest gain rate if µ=…, ie if we use the steepest descent direction
hsd given by
hsd = ¡F 0(x) : (2.8)
The method based on (2.8) (ie hd = hsd in Algorithm 2.4) is called the steep-
est descent method or gradient method. The choice of descent direction is
“the best” (locally) and we could combine it with an exact line search (2.7).
A method like this converges, but the final convergence is linear and often
very slow. Examples in Frandsen et al (2004) show how the steepest descent
method with exact line search and finite computer precision can fail to find
the minimizer of a second degree polynomial. For many problems, however,
the method has quite good performance in the initial stage of the iterative
process.
2.2. Newton’s Method 8
Considerations like this has lead to the so-called hybrid methods, which – as
the name suggests – are based on two different methods. One which is good
in the initial stage, like the gradient method, and another method which is
good in the final stage, like Newton’s method; see the next section. A major
problem with a hybrid method is the mechanism which switches between
the two methods when appropriate.
2.2. Newton’s Method
We can derive this method from the condition that x⁄ is a stationary point.
According to Definition 1.6 it satisfies F 0(x⁄) = 0. This is a nonlinear sys-
tem of equations, and from the Taylor expansion
F 0(x+h) = F 0(x) + F 00(x)h +O(khk2)
’ F 0(x) + F 00(x)h for khk sufficiently small
we derive Newton’s method: Find hn as the solutions to
H hn = ¡F 0(x) with H = F 00(x) ; (2.9a)
and compute the next iterate by
x := x + hn : (2.9b)
Suppose that H is positive definite, then it is nonsingular (implying that
(2.9a) has a unique solution), and u>H u> 0 for all nonzero u. Thus, by
multiplying with h>n on both sides of (2.9a) we get
0 < h>n H hn = ¡h>n F 0(x) ; (2.10)
showing that hn is a descent direction: it satisfies the condition in Defini-
tion 2.6.
Newton’s method is very good in the final stage of the iteration, where x is
close to x⁄. One can show (see Frandsen et al (2004)) that if the Hessian
at the solution is positive definite (the sufficient condition in Theorem 1.8
is satisfied) and if we are at a position inside the region around x⁄ where
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
9 2. DESCENT METHODS
F 00(x) is positive definite, then we get quadratic convergence (defined in
(2.3)). On the other hand, if x is in a region where F 00(x) is negative definite
everywhere, and where there is a stationary point, the basic Newton method
(2.9) would converge (quadratically) towards this stationary point, which
is a maximizer. We can avoid this by requiring that all steps taken are in
descent directions.
We can build a hybrid method, based on Newton’s method and the steepest
descent method. According to (2.10) the Newton step is guaranteed to be
downhill if F 00(x) is positive definite, so a sketch of the central section of
this hybrid algorithm could be
if F 00(x) is positive definite
h := hn
else
h := hsd
x := x + fih
(2.11)
Here, hsd is the steepest descent direction and fi is found by line search; see
Section 2.3. A good tool for checking a matrix for positive definiteness is
Cholesky’s method (see Appendix A) which, when successful, is also used
for solving the linear system in question. Thus, the check for definiteness is
almost for free.
In Section 2.4 we introduce some methods, where the computation of the
search direction hd and step length fi is done simultaneously, and give a
version of (2.11) without line search. Such hybrid methods can be very
efficient, but they are hardly ever used. The reason is that they need an im-
plementation of F 00(x), and for complicated application problems this is not
available. Instead we can use a so-called Quasi-Newton method, based on
series of matrices which gradually approach H⁄= F 00(x⁄). In Section 3.4
we present such a method. Also see Chapter 5 in Frandsen et al (2004).
2.3. Line Search
Given a point x and a descent direction h. The next iteration step is a move
from x in direction h. To find out, how far to move, we study the variation
of the given function along the half line from x in the direction h,
2.3. Line Search 10
’(fi) = F (x+fih) ; x and h fixed; fi‚ 0 : (2.12)
An example of the behaviour of ’(fi) is shown in Figure 2.1.
α
y
y = φ(0)
y = φ(α)
Figure 2.1. Variation of the cost
function along the search line.
Our h being a descent direction ensures that
’ 0(0) = h>F 0(x) < 0 ;
indicating that if fi is sufficiently small, we satisfy the descending condition
(2.1), which is equivalent to
’(fi) < ’(0) :
Often, we are given an initial guess on fi, eg fi= 1 with Newton’s method.
Figure 2.1 illustrates that three different situations can arise
1– fi is so small that the gain in value of the objective function is very
small. fi should be increased.
2– fi is too large: ’(fi)‚’(0). Decrease fi in order to satisfy the descent
condition (2.1).
3– fi is close to the minimizer1) of ’(fi). Accept this fi-value.
1) More precisely: the smallest local minimizer of ’. If we increase fi beyond the interval
shown in Figure 2.1, it may well happen that we get close to another local minimum
for F .
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
11 2. DESCENT METHODS
An exact line search is an iterative process producing a series fi1; fi2 : : : .
The aim is to find the true minimizer fie defined in (2.7), and the algorithm
stops when the iterate fis satisfies
j’0(fis)j • ¿ j’0(0)j ;
where ¿ is a small, positive number. In the iteration we can use approxima-
tions to the variation of ’(fi) based on the computed values of
’(fik) = F (x+fikh) and ’0(fik) = h>F 0(x+fikh) :
See Sections 2.5 – 2.6 in Frandsen et al (2004) for details.
Exact line search can waste much computing time: When x is far from x⁄
the search direction h may be far from the direction x⁄¡x, and there is no
need to find the true minimum of ’ very accurately. This is the background
for the so-called soft line search, where we accept an fi-value if it does not
fall in the categories 1– or 2– listed above. We use a stricter version of the
descending condition (2.1), viz
’(fis) • ’(0) + °1 ¢ ’0(0) ¢ fi with 0<°1< 1 : (2.13a)
This ensures that we are not in case 2–. Case 1– corresponds to the point
(fi;’(fi)) being too close to the starting tangent, and we supplement with
the condition
’0(fis) ‚ °2 ¢ ’0(0) with °1 < °2 < 1 : (2.13b)
If the starting guess on fi satisfies both these criteria, then we accept it as
fis. Otherwise, we have to iterate as sketched for exact line search. Details
can be seen in Section 2.5 of Frandsen et al (2004).
2.4. Trust Region and Damped Methods
Assume that we have a model L of the behaviour of F in the neighbourhood
of the current iterate x,
F (x+h) ’ L(h) · F (x) + h>c + 1
2
h>B h ; (2.14)
2.4. Trust Region and Damped Methods 12
where c2 IRn and the matrix B2 IRn£n is symmetric. The basic ideas of this
section may be generalized to other forms of the model, but in this booklet
we only need the form of L given in (2.14). Typically, the model is a second
order Taylor expansion of F around x, like the first three terms in the right-
hand side of (1.4a), or L(h) may be an approximation to this expansion. It
is generally true that such a model is good only when h is sufficiently small.
We shall introduce two methods that include this aspect in the determination
of a step h, which is a descent direction and which can be used with fi= 1
in Algorithm 2.4.
In a trust region method we assume that we know a positive number ¢ such
that the model is sufficiently accurate inside a ball with radius ¢, centered
at x, and determine the step as
h = htr · argminkhk•¢fL(h)g: (2.15)
In a damped method the step is determined as
h = hdm · argminhfL(h) + 12 „ h>hg; (2.16)
where the damping parameter „ ‚ 0. The term 12 „h>h = 12 „khk2 is seen
to penalize large steps.
The central part of Algorithm 2.4 based on one of these methods has the
form
Compute h by (2.15) or (2.16)
if F (x+h) < F (x)
x := x + h
Update ¢ or „
(2.17)
This corresponds to fi= 1 if the step h satisfies the descending condition
(2.1). Otherwise, fi= 0, ie we do not move.2) However, we are not stuck
2) There are versions of these methods that include a proper line search to find a point
x+fih with smaller F -value, and information gathered during the line search is used in
the updating of ¢ or „. For many problems such versions use fewer iteration steps but a
larger accumulated number of function values.
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
Administrator
线条
13 2. DESCENT METHODS
at x (unless x = x⁄): by a proper modification of ¢ or „ we aim at having
better luck in the next iteration step.
Since L(h) is assumed to be a good approximation to F (x+h) for h suf-
ficiently small, the reason why the step failed is that h was too large, and
should be reduced. Further, if the step is accepted, it may be possible to use
a larger step from the new iterate and thereby reduce the number of steps
needed before we reach x⁄.
The quality of the model with the computed step can be evaluated by the
so-called gain ratio
% =
F (x)¡ F (x+h)
L(0)¡ L(h) ; (2.18)
ie the ratio between the actual and predicted decrease in function value. By
construction the denominator is positive, and the numerator is negative if
the step was not downhill – it was too large and should be reduced.
With a trust region method we monitor the step length by the size of the
radius ¢. The following updating strategy is widely used,
if % < 0:25
¢ := ¢=2
elseif % > 0:75
¢ := maxf¢; 3 ⁄ khkg
(2.19)
Thus, if % < 1
4
, we decide to use smaller steps, while % > 3
4
indicates that it
may be possible to use larger steps. A trust region algorithm is not sensitive
to minor changes in the thresholds 0:25 and 0:75, the divisor p1 = 2 or the
factor p2 = 3, but it is important that the numbers p1 and p2 are chosen so
that the ¢-va
本文档为【非线性最小二乘法】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。