Upload
sharifmmizanurrahman
View
228
Download
0
Embed Size (px)
Citation preview
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 138
Chapter 1
Mathematical Review and Computer Arithmetic
11 Mathematical Review
The tools of scientific engineering and operations research computing arefirmly based in the calculus In particular formulating and solving mathematical models in these areas involves approximation of quantities such asintegrals derivatives solutions to differential equations and solutions to systems of equations first seen an a calculus course Indeed techniques fromsuch a course are the basis of much of scientific computation We reviewthese techniques here with particular emphasis on how we will use them
In addition to basic calculus techniques scientific computing involves approximation of the real number system by decimal numbers with a fixednumber of digits in their representation Except for certain researchorientedsystems computer number systems today for this purpose are floating point systems and almost all such floating point systems in use today adhere to theIEEE 7542008 floating point standard We describe floating point numbers
and the floating point standard in this chapter paying particular attention toconsequences and pitfalls of its use
Third programming and software tools are used in scientific computingConsidering how commonly it is used ease of programming and debuggingdocumentation and packages accessible from it we have elected to use matlab throughout this book We introduce the basics of matlab in this chapter
111 Intermediate Value Theorem Mean Value Theoremsand Taylorrsquos Theorem
Throughout C n[a b] will denote the set of realvalued functions f definedon the interval [a b] such that f and its derivatives up to and including itsnth derivative f (n) are continuous on [a b]
THEOREM 11
(Intermediate value theorem) If f isin C [a b] and k is any number between m = min
alexlebf (x) and M = max
alexlebf (x) then there exists a number c in [a b]
1
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 238
2 Applied Numerical Methods
for which f (c) = k (Figure 11)
x
y
y = f (x)
m +
k +
M +
a+
c+
b+
FIGURE 11 Illustration of the Intermediate Value Theorem
Example 11
Consider f (x) = ex minus x minus 2 Using a computational device (such as a calculator) on which we trust the approximation of ex to be accurate we computef (0) = minus1 and f (2) asymp 33891 We know f is continuous since it is a sumof continuous functions Since 0 is between f (0) and f (2) the IntermediateValue Theorem tells us there is a point c isin [0 2] such that f (c) = 0 At sucha c ec = c + 2
THEOREM 12
(Mean value theorem for integrals) Let f be continuous and w be Riemann integrable 1 on [a b] and suppose that w(x) ge 0 for x isin [a b] Then there exists a point c in [a b] such that
ba
w(x)f (x)dx = f (c)
ba
w(x)dx
Example 12
Suppose we want bounds on
1
0
x2eminusx2dx
1This means that the limit of the Riemann sums exists For example w may be continuousor w may have a finite number of breaks
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 338
Mathematical Review and Computer Arithmetic 3
With w(x) = x2 and f (x) = eminusx2 the Mean Value Theorem for integrals tells
us that 10
x2eminusx2dx = eminusc2 10
x2dx
for some c isin [0 1] so
1
3e le 10
x2eminusx2dx = eminusc2 10
x2dx le 1
3
The following is extremely important in scientific and engineering computing
THEOREM 13
(Taylorrsquos theorem) Suppose that f isin C n+1[a b] Let x0 isin [a b] Then for any x isin [a b]
f (x) = P n(x) + Rn(x) where
P n(x) = f (x0) + f (x0)(x minus x0) + middot middot middot + f (n)(x0)(x minus x0)n
n
=
n1048573k=0
1
kf (k)(x0)(x minus x0)k and
Rn(x) = 1
n
xx0
f (n+1)(t)(x minus t)ndt (integral form of remainder)
Furthermore there is a ξ = ξ (x) between x0 and x with
Rn(x) = f (n+1)(ξ (x))(x minus x0)n+1
(n + 1) (Lagrange form of remainder)
PROOF Recall the integration by parts formula1114109
udv = uv minus 1114109 vdu
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 438
4 Applied Numerical Methods
Thus
f (x) minus f (x0) =
xx0
f (t)dt (let u = f (t) v = t minus x dv = dt)
= f (x0)(x minus x0) +
xx0
(x minus t)f (t)dt
(let u = f (t) dv = (x minus t)dt)
= f (x0)(x minus x0) minus (x minus t)2
2 f (t)
917501917501917501917501917501x
x0
+
xx0
(x minus t)2
2 f (t)dt
= f (x0)(x minus x0) + (x minus x0)2
2 f (x0) +
xx0
(x minus t)2
2 f (t)dt
Continuing this procedure
f (x) = f (x0) + f (x0)(x minus x0) + (x minus x0)2
2 f (x0)
+ middot middot middot + (x minus x0)n
n f (n)(x0) +
xx0
(x minus t)n
n f (n+1)(t)dt
= P n(x) + Rn(x)
Now consider Rn(x) = xx0
(x minus t)
n
n f (n+1)(t)dt and assume that x0 lt x (same
argument if x0 gt x) Then by Theorem 12
Rn(x) = f (n+1)(ξ (x))
xx0
(x minus t)n
n dt = f (n+1)(ξ (x))
(x minus x0)n+1
(n + 1)
where ξ is between x0 and x and thus ξ = ξ (x)
Example 13
Approximate sin(x) by a polynomial p(x) such that
sin(x)
minus p(x)
 le10minus16
for minus01 le x le 01
For Example 13 Taylor polynomials about x0 = 0 are appropriate sincethat is the center of the interval about which we wish to approximate Weobserve that the terms of even degree in such a polynomial are absent so for
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 538
Mathematical Review and Computer Arithmetic 5
n even Taylorrsquos theorem gives
n P n Rn
2 x minus x3
3 cos(c2)
4 x minus x3
3
x5
5 cos(c4)
6 x minus x3
3 +
x5
5 minus x7
7 cos(c6)
n mdash (minus1)n2 xn+1
(n + 1) cos(cn)
Observing that  cos(cn) le 1 we see that
Rn(x) le xn+1
(n + 1)
We may thus form the following table
n bound on error Rn
2 167 times 10minus4
4 833 times 10minus8
6 198 times 10minus11
8 276 times 10minus15
10 251 times 10minus19
Thus a polynomial with the required accuracy for x isin [minus01 01] is
p(x) = x minus x3
3 +
x5
5 minus x7
7 +
x9
9
An important special case of Taylorrsquos theorem is obtained with n = 0 (thatis directly from the Fundamental Theorem of Calculus)
THEOREM 14
(Mean value theorem) Suppose f isin C 1[a b] x isin [a b] and y isin [a b] (andwithout loss of generality x le y) Then there is a c isin [x y] sube [a b] such that
f (y) minus f (x) = f (c)(y minus x)
Example 14
Suppose f (1) = 1 and f (x) le 2 for x isin [1 2] What are an upper boundand a lower bound on f (2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 638
6 Applied Numerical Methods
The mean value theorem tells us that
f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)
for some c isin (1 2) Furthermore the fact f (x) le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives
1 minus 2 = minus1 le f (2) le 1 + 2 = 3
112 Big ldquoOrdquo Notation
We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get
eh
minus(1 + h) =
1
2h2eξ
where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowlyvarying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality
DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that
E (h)
leChk
for all h le
The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo
Example 15
E (h) = eh minus h minus 1 Then E (h) = O(h2)
PROOF By Taylorrsquos Theorem
eh = e0 + e0(hminus
0) + h2
2 eξ
for some c between 0 and h Thus
E (h) = eh minus 1 minus h le h2
e1
2
and E (h) ge 0
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 738
Mathematical Review and Computer Arithmetic 7
for h le 1 that is = 1 and C = e2 work
Example 16
Show that917501917501f (x+h)minusf (x)
h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f
has two continuous derivatives at each point in [a b]
PROOF
917501917501917501917501
f (x + h) minus f (x)
h minus f (x)
917501917501917501917501
=
917501917501917501917501917501917501917501917501917501f (x) + f (x)h +
x+h
x
(x + h minus t)f (t)dt minus f (x)
h minus f (x)
917501917501917501917501917501917501917501917501917501=
1
h
917501917501917501917501917501 x+h
x
(x + h minus t)f (t)dt
917501917501917501917501917501 le maxaletleb
f (t) h
2 = ch
113 Convergence Rates
DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that xk+1 minusxlowast le C xk minusxlowastα for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic
Example 17
A sequence sometimes learned in elementary classes for computing the squareroot of a number a is
xk+1 = xk
2 +
a
2xk
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 838
8 Applied Numerical Methods
We have
xk+1 minus radic a = xk
2 + a
2xkminus radic a
= xk minus x2k minus a
2xkminus radic
a
= (xk minus radic a) minus (xk minus radic
a)xk +
radic a
2xk
= (xk minus radic a)
1 minus xk +
radic a
2xk
= (xk minus radic a)
xk minus radic a
2xk
= 1
2xk
(xk
minus
radic a)2
asymp 1
2radic
a(xk minus radic
a)2
for xk near radic
a thus showing that the convergence rate is quadratic
Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table
k xk xk minus radic 2 xk minus radic 2(xkminus1 minus radic
2)2
0 2 05858 times 100 mdash
1 15 08579 times 10minus1 02500
2 1416666666666667 02453 times 10minus2 03333
3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535
5 1414213562373095 02204 times 10minus17 mdash
In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the
multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2
radic 2) asymp 0353553390593274 (We
explain how we computed the table at the end of this chapter)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 938
Mathematical Review and Computer Arithmetic 9
Example 18
As an example of linear convergence consider the iteration
xk+1 = xk minus x2k
35 +
2
35
which converges toradic
2 We obtain the following table
k xk xk minus radic 2
xk minus radic 2
(xkminus1 minus radic 2)
0 2 05858 times 100 mdash
1 1428571428571429 01436 times 10minus1 02451 times 10minus1
2 1416909620991254 02696 times 10minus2 01878
3 1414728799831946 05152 times 10minus3
019114 1414312349239392 09879 times 10minus4 01917
5 1414232514607664 01895 times 10minus4 01918
6 1414217198786659 03636 times 10minus5 01919
7 1414214260116949 06955 times 10minus6 01919
8 1414213696254626 01339 times 10minus6 01919
19 1414213562373097 01554 times 10minus14 mdash
Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase5 digit on each iteration
12 Computer Arithmetic
In numerical solution of mathematical problems two common types of errorare
1 Method (algorithm or truncation) error This is the error due to approximations made in the numerical method
2 Rounding error This is the error made due to the finite number of digitsavailable on a computer
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1038
10 Applied Numerical Methods
Example 19
By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then
f (x) = f (x + h) minus f (x)
h +
1
h
x+h
x
f (t)(x + h minus t)dt
and
917501917501917501917501917501 1
h
x+h
x
f (t)(x + h minus t)dt
917501917501917501917501917501 le ch
Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations
Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small
using a calculator having 11 digits The following results were obtained
h ln(3 + h) minus ln(3)
h Error =
1
3 minus ln(3 + h) minus ln(3)
h = O(h)
10minus1 03278982 544 times10minus3
10minus2 0332779 554 times10minus4
10minus3 03332778 555 times10minus5
10minus4 0333328 533 times10minus6
10minus5 0333330 333 times10minus6
10minus6 0333300 333 times10minus5
10minus7 0333 333 times10minus4
10minus8 033 333 times10minus3
10minus9 03 333 times10minus2
10minus10
00 333 times10minus1
One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)
There are two possible ways to reduce rounding error
1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider
f (x) = f (x + h) minus f (x minus h)
2h + error where error is O(h2)
h ln(3 + h) minus ln(3 minus h)2h error
01 03334568 124 times10minus4
001 03333345 123 times10minus6
0001 03333333 191 times10minus8
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 238
2 Applied Numerical Methods
for which f (c) = k (Figure 11)
x
y
y = f (x)
m +
k +
M +
a+
c+
b+
FIGURE 11 Illustration of the Intermediate Value Theorem
Example 11
Consider f (x) = ex minus x minus 2 Using a computational device (such as a calculator) on which we trust the approximation of ex to be accurate we computef (0) = minus1 and f (2) asymp 33891 We know f is continuous since it is a sumof continuous functions Since 0 is between f (0) and f (2) the IntermediateValue Theorem tells us there is a point c isin [0 2] such that f (c) = 0 At sucha c ec = c + 2
THEOREM 12
(Mean value theorem for integrals) Let f be continuous and w be Riemann integrable 1 on [a b] and suppose that w(x) ge 0 for x isin [a b] Then there exists a point c in [a b] such that
ba
w(x)f (x)dx = f (c)
ba
w(x)dx
Example 12
Suppose we want bounds on
1
0
x2eminusx2dx
1This means that the limit of the Riemann sums exists For example w may be continuousor w may have a finite number of breaks
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 338
Mathematical Review and Computer Arithmetic 3
With w(x) = x2 and f (x) = eminusx2 the Mean Value Theorem for integrals tells
us that 10
x2eminusx2dx = eminusc2 10
x2dx
for some c isin [0 1] so
1
3e le 10
x2eminusx2dx = eminusc2 10
x2dx le 1
3
The following is extremely important in scientific and engineering computing
THEOREM 13
(Taylorrsquos theorem) Suppose that f isin C n+1[a b] Let x0 isin [a b] Then for any x isin [a b]
f (x) = P n(x) + Rn(x) where
P n(x) = f (x0) + f (x0)(x minus x0) + middot middot middot + f (n)(x0)(x minus x0)n
n
=
n1048573k=0
1
kf (k)(x0)(x minus x0)k and
Rn(x) = 1
n
xx0
f (n+1)(t)(x minus t)ndt (integral form of remainder)
Furthermore there is a ξ = ξ (x) between x0 and x with
Rn(x) = f (n+1)(ξ (x))(x minus x0)n+1
(n + 1) (Lagrange form of remainder)
PROOF Recall the integration by parts formula1114109
udv = uv minus 1114109 vdu
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 438
4 Applied Numerical Methods
Thus
f (x) minus f (x0) =
xx0
f (t)dt (let u = f (t) v = t minus x dv = dt)
= f (x0)(x minus x0) +
xx0
(x minus t)f (t)dt
(let u = f (t) dv = (x minus t)dt)
= f (x0)(x minus x0) minus (x minus t)2
2 f (t)
917501917501917501917501917501x
x0
+
xx0
(x minus t)2
2 f (t)dt
= f (x0)(x minus x0) + (x minus x0)2
2 f (x0) +
xx0
(x minus t)2
2 f (t)dt
Continuing this procedure
f (x) = f (x0) + f (x0)(x minus x0) + (x minus x0)2
2 f (x0)
+ middot middot middot + (x minus x0)n
n f (n)(x0) +
xx0
(x minus t)n
n f (n+1)(t)dt
= P n(x) + Rn(x)
Now consider Rn(x) = xx0
(x minus t)
n
n f (n+1)(t)dt and assume that x0 lt x (same
argument if x0 gt x) Then by Theorem 12
Rn(x) = f (n+1)(ξ (x))
xx0
(x minus t)n
n dt = f (n+1)(ξ (x))
(x minus x0)n+1
(n + 1)
where ξ is between x0 and x and thus ξ = ξ (x)
Example 13
Approximate sin(x) by a polynomial p(x) such that
sin(x)
minus p(x)
 le10minus16
for minus01 le x le 01
For Example 13 Taylor polynomials about x0 = 0 are appropriate sincethat is the center of the interval about which we wish to approximate Weobserve that the terms of even degree in such a polynomial are absent so for
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 538
Mathematical Review and Computer Arithmetic 5
n even Taylorrsquos theorem gives
n P n Rn
2 x minus x3
3 cos(c2)
4 x minus x3
3
x5
5 cos(c4)
6 x minus x3
3 +
x5
5 minus x7
7 cos(c6)
n mdash (minus1)n2 xn+1
(n + 1) cos(cn)
Observing that  cos(cn) le 1 we see that
Rn(x) le xn+1
(n + 1)
We may thus form the following table
n bound on error Rn
2 167 times 10minus4
4 833 times 10minus8
6 198 times 10minus11
8 276 times 10minus15
10 251 times 10minus19
Thus a polynomial with the required accuracy for x isin [minus01 01] is
p(x) = x minus x3
3 +
x5
5 minus x7
7 +
x9
9
An important special case of Taylorrsquos theorem is obtained with n = 0 (thatis directly from the Fundamental Theorem of Calculus)
THEOREM 14
(Mean value theorem) Suppose f isin C 1[a b] x isin [a b] and y isin [a b] (andwithout loss of generality x le y) Then there is a c isin [x y] sube [a b] such that
f (y) minus f (x) = f (c)(y minus x)
Example 14
Suppose f (1) = 1 and f (x) le 2 for x isin [1 2] What are an upper boundand a lower bound on f (2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 638
6 Applied Numerical Methods
The mean value theorem tells us that
f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)
for some c isin (1 2) Furthermore the fact f (x) le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives
1 minus 2 = minus1 le f (2) le 1 + 2 = 3
112 Big ldquoOrdquo Notation
We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get
eh
minus(1 + h) =
1
2h2eξ
where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowlyvarying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality
DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that
E (h)
leChk
for all h le
The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo
Example 15
E (h) = eh minus h minus 1 Then E (h) = O(h2)
PROOF By Taylorrsquos Theorem
eh = e0 + e0(hminus
0) + h2
2 eξ
for some c between 0 and h Thus
E (h) = eh minus 1 minus h le h2
e1
2
and E (h) ge 0
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 738
Mathematical Review and Computer Arithmetic 7
for h le 1 that is = 1 and C = e2 work
Example 16
Show that917501917501f (x+h)minusf (x)
h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f
has two continuous derivatives at each point in [a b]
PROOF
917501917501917501917501
f (x + h) minus f (x)
h minus f (x)
917501917501917501917501
=
917501917501917501917501917501917501917501917501917501f (x) + f (x)h +
x+h
x
(x + h minus t)f (t)dt minus f (x)
h minus f (x)
917501917501917501917501917501917501917501917501917501=
1
h
917501917501917501917501917501 x+h
x
(x + h minus t)f (t)dt
917501917501917501917501917501 le maxaletleb
f (t) h
2 = ch
113 Convergence Rates
DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that xk+1 minusxlowast le C xk minusxlowastα for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic
Example 17
A sequence sometimes learned in elementary classes for computing the squareroot of a number a is
xk+1 = xk
2 +
a
2xk
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 838
8 Applied Numerical Methods
We have
xk+1 minus radic a = xk
2 + a
2xkminus radic a
= xk minus x2k minus a
2xkminus radic
a
= (xk minus radic a) minus (xk minus radic
a)xk +
radic a
2xk
= (xk minus radic a)
1 minus xk +
radic a
2xk
= (xk minus radic a)
xk minus radic a
2xk
= 1
2xk
(xk
minus
radic a)2
asymp 1
2radic
a(xk minus radic
a)2
for xk near radic
a thus showing that the convergence rate is quadratic
Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table
k xk xk minus radic 2 xk minus radic 2(xkminus1 minus radic
2)2
0 2 05858 times 100 mdash
1 15 08579 times 10minus1 02500
2 1416666666666667 02453 times 10minus2 03333
3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535
5 1414213562373095 02204 times 10minus17 mdash
In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the
multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2
radic 2) asymp 0353553390593274 (We
explain how we computed the table at the end of this chapter)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 938
Mathematical Review and Computer Arithmetic 9
Example 18
As an example of linear convergence consider the iteration
xk+1 = xk minus x2k
35 +
2
35
which converges toradic
2 We obtain the following table
k xk xk minus radic 2
xk minus radic 2
(xkminus1 minus radic 2)
0 2 05858 times 100 mdash
1 1428571428571429 01436 times 10minus1 02451 times 10minus1
2 1416909620991254 02696 times 10minus2 01878
3 1414728799831946 05152 times 10minus3
019114 1414312349239392 09879 times 10minus4 01917
5 1414232514607664 01895 times 10minus4 01918
6 1414217198786659 03636 times 10minus5 01919
7 1414214260116949 06955 times 10minus6 01919
8 1414213696254626 01339 times 10minus6 01919
19 1414213562373097 01554 times 10minus14 mdash
Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase5 digit on each iteration
12 Computer Arithmetic
In numerical solution of mathematical problems two common types of errorare
1 Method (algorithm or truncation) error This is the error due to approximations made in the numerical method
2 Rounding error This is the error made due to the finite number of digitsavailable on a computer
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1038
10 Applied Numerical Methods
Example 19
By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then
f (x) = f (x + h) minus f (x)
h +
1
h
x+h
x
f (t)(x + h minus t)dt
and
917501917501917501917501917501 1
h
x+h
x
f (t)(x + h minus t)dt
917501917501917501917501917501 le ch
Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations
Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small
using a calculator having 11 digits The following results were obtained
h ln(3 + h) minus ln(3)
h Error =
1
3 minus ln(3 + h) minus ln(3)
h = O(h)
10minus1 03278982 544 times10minus3
10minus2 0332779 554 times10minus4
10minus3 03332778 555 times10minus5
10minus4 0333328 533 times10minus6
10minus5 0333330 333 times10minus6
10minus6 0333300 333 times10minus5
10minus7 0333 333 times10minus4
10minus8 033 333 times10minus3
10minus9 03 333 times10minus2
10minus10
00 333 times10minus1
One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)
There are two possible ways to reduce rounding error
1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider
f (x) = f (x + h) minus f (x minus h)
2h + error where error is O(h2)
h ln(3 + h) minus ln(3 minus h)2h error
01 03334568 124 times10minus4
001 03333345 123 times10minus6
0001 03333333 191 times10minus8
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 338
Mathematical Review and Computer Arithmetic 3
With w(x) = x2 and f (x) = eminusx2 the Mean Value Theorem for integrals tells
us that 10
x2eminusx2dx = eminusc2 10
x2dx
for some c isin [0 1] so
1
3e le 10
x2eminusx2dx = eminusc2 10
x2dx le 1
3
The following is extremely important in scientific and engineering computing
THEOREM 13
(Taylorrsquos theorem) Suppose that f isin C n+1[a b] Let x0 isin [a b] Then for any x isin [a b]
f (x) = P n(x) + Rn(x) where
P n(x) = f (x0) + f (x0)(x minus x0) + middot middot middot + f (n)(x0)(x minus x0)n
n
=
n1048573k=0
1
kf (k)(x0)(x minus x0)k and
Rn(x) = 1
n
xx0
f (n+1)(t)(x minus t)ndt (integral form of remainder)
Furthermore there is a ξ = ξ (x) between x0 and x with
Rn(x) = f (n+1)(ξ (x))(x minus x0)n+1
(n + 1) (Lagrange form of remainder)
PROOF Recall the integration by parts formula1114109
udv = uv minus 1114109 vdu
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 438
4 Applied Numerical Methods
Thus
f (x) minus f (x0) =
xx0
f (t)dt (let u = f (t) v = t minus x dv = dt)
= f (x0)(x minus x0) +
xx0
(x minus t)f (t)dt
(let u = f (t) dv = (x minus t)dt)
= f (x0)(x minus x0) minus (x minus t)2
2 f (t)
917501917501917501917501917501x
x0
+
xx0
(x minus t)2
2 f (t)dt
= f (x0)(x minus x0) + (x minus x0)2
2 f (x0) +
xx0
(x minus t)2
2 f (t)dt
Continuing this procedure
f (x) = f (x0) + f (x0)(x minus x0) + (x minus x0)2
2 f (x0)
+ middot middot middot + (x minus x0)n
n f (n)(x0) +
xx0
(x minus t)n
n f (n+1)(t)dt
= P n(x) + Rn(x)
Now consider Rn(x) = xx0
(x minus t)
n
n f (n+1)(t)dt and assume that x0 lt x (same
argument if x0 gt x) Then by Theorem 12
Rn(x) = f (n+1)(ξ (x))
xx0
(x minus t)n
n dt = f (n+1)(ξ (x))
(x minus x0)n+1
(n + 1)
where ξ is between x0 and x and thus ξ = ξ (x)
Example 13
Approximate sin(x) by a polynomial p(x) such that
sin(x)
minus p(x)
 le10minus16
for minus01 le x le 01
For Example 13 Taylor polynomials about x0 = 0 are appropriate sincethat is the center of the interval about which we wish to approximate Weobserve that the terms of even degree in such a polynomial are absent so for
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 538
Mathematical Review and Computer Arithmetic 5
n even Taylorrsquos theorem gives
n P n Rn
2 x minus x3
3 cos(c2)
4 x minus x3
3
x5
5 cos(c4)
6 x minus x3
3 +
x5
5 minus x7
7 cos(c6)
n mdash (minus1)n2 xn+1
(n + 1) cos(cn)
Observing that  cos(cn) le 1 we see that
Rn(x) le xn+1
(n + 1)
We may thus form the following table
n bound on error Rn
2 167 times 10minus4
4 833 times 10minus8
6 198 times 10minus11
8 276 times 10minus15
10 251 times 10minus19
Thus a polynomial with the required accuracy for x isin [minus01 01] is
p(x) = x minus x3
3 +
x5
5 minus x7
7 +
x9
9
An important special case of Taylorrsquos theorem is obtained with n = 0 (thatis directly from the Fundamental Theorem of Calculus)
THEOREM 14
(Mean value theorem) Suppose f isin C 1[a b] x isin [a b] and y isin [a b] (andwithout loss of generality x le y) Then there is a c isin [x y] sube [a b] such that
f (y) minus f (x) = f (c)(y minus x)
Example 14
Suppose f (1) = 1 and f (x) le 2 for x isin [1 2] What are an upper boundand a lower bound on f (2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 638
6 Applied Numerical Methods
The mean value theorem tells us that
f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)
for some c isin (1 2) Furthermore the fact f (x) le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives
1 minus 2 = minus1 le f (2) le 1 + 2 = 3
112 Big ldquoOrdquo Notation
We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get
eh
minus(1 + h) =
1
2h2eξ
where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowlyvarying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality
DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that
E (h)
leChk
for all h le
The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo
Example 15
E (h) = eh minus h minus 1 Then E (h) = O(h2)
PROOF By Taylorrsquos Theorem
eh = e0 + e0(hminus
0) + h2
2 eξ
for some c between 0 and h Thus
E (h) = eh minus 1 minus h le h2
e1
2
and E (h) ge 0
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 738
Mathematical Review and Computer Arithmetic 7
for h le 1 that is = 1 and C = e2 work
Example 16
Show that917501917501f (x+h)minusf (x)
h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f
has two continuous derivatives at each point in [a b]
PROOF
917501917501917501917501
f (x + h) minus f (x)
h minus f (x)
917501917501917501917501
=
917501917501917501917501917501917501917501917501917501f (x) + f (x)h +
x+h
x
(x + h minus t)f (t)dt minus f (x)
h minus f (x)
917501917501917501917501917501917501917501917501917501=
1
h
917501917501917501917501917501 x+h
x
(x + h minus t)f (t)dt
917501917501917501917501917501 le maxaletleb
f (t) h
2 = ch
113 Convergence Rates
DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that xk+1 minusxlowast le C xk minusxlowastα for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic
Example 17
A sequence sometimes learned in elementary classes for computing the squareroot of a number a is
xk+1 = xk
2 +
a
2xk
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 838
8 Applied Numerical Methods
We have
xk+1 minus radic a = xk
2 + a
2xkminus radic a
= xk minus x2k minus a
2xkminus radic
a
= (xk minus radic a) minus (xk minus radic
a)xk +
radic a
2xk
= (xk minus radic a)
1 minus xk +
radic a
2xk
= (xk minus radic a)
xk minus radic a
2xk
= 1
2xk
(xk
minus
radic a)2
asymp 1
2radic
a(xk minus radic
a)2
for xk near radic
a thus showing that the convergence rate is quadratic
Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table
k xk xk minus radic 2 xk minus radic 2(xkminus1 minus radic
2)2
0 2 05858 times 100 mdash
1 15 08579 times 10minus1 02500
2 1416666666666667 02453 times 10minus2 03333
3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535
5 1414213562373095 02204 times 10minus17 mdash
In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the
multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2
radic 2) asymp 0353553390593274 (We
explain how we computed the table at the end of this chapter)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 938
Mathematical Review and Computer Arithmetic 9
Example 18
As an example of linear convergence consider the iteration
xk+1 = xk minus x2k
35 +
2
35
which converges toradic
2 We obtain the following table
k xk xk minus radic 2
xk minus radic 2
(xkminus1 minus radic 2)
0 2 05858 times 100 mdash
1 1428571428571429 01436 times 10minus1 02451 times 10minus1
2 1416909620991254 02696 times 10minus2 01878
3 1414728799831946 05152 times 10minus3
019114 1414312349239392 09879 times 10minus4 01917
5 1414232514607664 01895 times 10minus4 01918
6 1414217198786659 03636 times 10minus5 01919
7 1414214260116949 06955 times 10minus6 01919
8 1414213696254626 01339 times 10minus6 01919
19 1414213562373097 01554 times 10minus14 mdash
Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase5 digit on each iteration
12 Computer Arithmetic
In numerical solution of mathematical problems two common types of errorare
1 Method (algorithm or truncation) error This is the error due to approximations made in the numerical method
2 Rounding error This is the error made due to the finite number of digitsavailable on a computer
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1038
10 Applied Numerical Methods
Example 19
By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then
f (x) = f (x + h) minus f (x)
h +
1
h
x+h
x
f (t)(x + h minus t)dt
and
917501917501917501917501917501 1
h
x+h
x
f (t)(x + h minus t)dt
917501917501917501917501917501 le ch
Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations
Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small
using a calculator having 11 digits The following results were obtained
h ln(3 + h) minus ln(3)
h Error =
1
3 minus ln(3 + h) minus ln(3)
h = O(h)
10minus1 03278982 544 times10minus3
10minus2 0332779 554 times10minus4
10minus3 03332778 555 times10minus5
10minus4 0333328 533 times10minus6
10minus5 0333330 333 times10minus6
10minus6 0333300 333 times10minus5
10minus7 0333 333 times10minus4
10minus8 033 333 times10minus3
10minus9 03 333 times10minus2
10minus10
00 333 times10minus1
One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)
There are two possible ways to reduce rounding error
1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider
f (x) = f (x + h) minus f (x minus h)
2h + error where error is O(h2)
h ln(3 + h) minus ln(3 minus h)2h error
01 03334568 124 times10minus4
001 03333345 123 times10minus6
0001 03333333 191 times10minus8
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 438
4 Applied Numerical Methods
Thus
f (x) minus f (x0) =
xx0
f (t)dt (let u = f (t) v = t minus x dv = dt)
= f (x0)(x minus x0) +
xx0
(x minus t)f (t)dt
(let u = f (t) dv = (x minus t)dt)
= f (x0)(x minus x0) minus (x minus t)2
2 f (t)
917501917501917501917501917501x
x0
+
xx0
(x minus t)2
2 f (t)dt
= f (x0)(x minus x0) + (x minus x0)2
2 f (x0) +
xx0
(x minus t)2
2 f (t)dt
Continuing this procedure
f (x) = f (x0) + f (x0)(x minus x0) + (x minus x0)2
2 f (x0)
+ middot middot middot + (x minus x0)n
n f (n)(x0) +
xx0
(x minus t)n
n f (n+1)(t)dt
= P n(x) + Rn(x)
Now consider Rn(x) = xx0
(x minus t)
n
n f (n+1)(t)dt and assume that x0 lt x (same
argument if x0 gt x) Then by Theorem 12
Rn(x) = f (n+1)(ξ (x))
xx0
(x minus t)n
n dt = f (n+1)(ξ (x))
(x minus x0)n+1
(n + 1)
where ξ is between x0 and x and thus ξ = ξ (x)
Example 13
Approximate sin(x) by a polynomial p(x) such that
sin(x)
minus p(x)
 le10minus16
for minus01 le x le 01
For Example 13 Taylor polynomials about x0 = 0 are appropriate sincethat is the center of the interval about which we wish to approximate Weobserve that the terms of even degree in such a polynomial are absent so for
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 538
Mathematical Review and Computer Arithmetic 5
n even Taylorrsquos theorem gives
n P n Rn
2 x minus x3
3 cos(c2)
4 x minus x3
3
x5
5 cos(c4)
6 x minus x3
3 +
x5
5 minus x7
7 cos(c6)
n mdash (minus1)n2 xn+1
(n + 1) cos(cn)
Observing that  cos(cn) le 1 we see that
Rn(x) le xn+1
(n + 1)
We may thus form the following table
n bound on error Rn
2 167 times 10minus4
4 833 times 10minus8
6 198 times 10minus11
8 276 times 10minus15
10 251 times 10minus19
Thus a polynomial with the required accuracy for x isin [minus01 01] is
p(x) = x minus x3
3 +
x5
5 minus x7
7 +
x9
9
An important special case of Taylorrsquos theorem is obtained with n = 0 (thatis directly from the Fundamental Theorem of Calculus)
THEOREM 14
(Mean value theorem) Suppose f isin C 1[a b] x isin [a b] and y isin [a b] (andwithout loss of generality x le y) Then there is a c isin [x y] sube [a b] such that
f (y) minus f (x) = f (c)(y minus x)
Example 14
Suppose f (1) = 1 and f (x) le 2 for x isin [1 2] What are an upper boundand a lower bound on f (2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 638
6 Applied Numerical Methods
The mean value theorem tells us that
f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)
for some c isin (1 2) Furthermore the fact f (x) le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives
1 minus 2 = minus1 le f (2) le 1 + 2 = 3
112 Big ldquoOrdquo Notation
We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get
eh
minus(1 + h) =
1
2h2eξ
where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowlyvarying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality
DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that
E (h)
leChk
for all h le
The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo
Example 15
E (h) = eh minus h minus 1 Then E (h) = O(h2)
PROOF By Taylorrsquos Theorem
eh = e0 + e0(hminus
0) + h2
2 eξ
for some c between 0 and h Thus
E (h) = eh minus 1 minus h le h2
e1
2
and E (h) ge 0
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 738
Mathematical Review and Computer Arithmetic 7
for h le 1 that is = 1 and C = e2 work
Example 16
Show that917501917501f (x+h)minusf (x)
h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f
has two continuous derivatives at each point in [a b]
PROOF
917501917501917501917501
f (x + h) minus f (x)
h minus f (x)
917501917501917501917501
=
917501917501917501917501917501917501917501917501917501f (x) + f (x)h +
x+h
x
(x + h minus t)f (t)dt minus f (x)
h minus f (x)
917501917501917501917501917501917501917501917501917501=
1
h
917501917501917501917501917501 x+h
x
(x + h minus t)f (t)dt
917501917501917501917501917501 le maxaletleb
f (t) h
2 = ch
113 Convergence Rates
DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that xk+1 minusxlowast le C xk minusxlowastα for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic
Example 17
A sequence sometimes learned in elementary classes for computing the squareroot of a number a is
xk+1 = xk
2 +
a
2xk
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 838
8 Applied Numerical Methods
We have
xk+1 minus radic a = xk
2 + a
2xkminus radic a
= xk minus x2k minus a
2xkminus radic
a
= (xk minus radic a) minus (xk minus radic
a)xk +
radic a
2xk
= (xk minus radic a)
1 minus xk +
radic a
2xk
= (xk minus radic a)
xk minus radic a
2xk
= 1
2xk
(xk
minus
radic a)2
asymp 1
2radic
a(xk minus radic
a)2
for xk near radic
a thus showing that the convergence rate is quadratic
Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table
k xk xk minus radic 2 xk minus radic 2(xkminus1 minus radic
2)2
0 2 05858 times 100 mdash
1 15 08579 times 10minus1 02500
2 1416666666666667 02453 times 10minus2 03333
3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535
5 1414213562373095 02204 times 10minus17 mdash
In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the
multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2
radic 2) asymp 0353553390593274 (We
explain how we computed the table at the end of this chapter)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 938
Mathematical Review and Computer Arithmetic 9
Example 18
As an example of linear convergence consider the iteration
xk+1 = xk minus x2k
35 +
2
35
which converges toradic
2 We obtain the following table
k xk xk minus radic 2
xk minus radic 2
(xkminus1 minus radic 2)
0 2 05858 times 100 mdash
1 1428571428571429 01436 times 10minus1 02451 times 10minus1
2 1416909620991254 02696 times 10minus2 01878
3 1414728799831946 05152 times 10minus3
019114 1414312349239392 09879 times 10minus4 01917
5 1414232514607664 01895 times 10minus4 01918
6 1414217198786659 03636 times 10minus5 01919
7 1414214260116949 06955 times 10minus6 01919
8 1414213696254626 01339 times 10minus6 01919
19 1414213562373097 01554 times 10minus14 mdash
Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase5 digit on each iteration
12 Computer Arithmetic
In numerical solution of mathematical problems two common types of errorare
1 Method (algorithm or truncation) error This is the error due to approximations made in the numerical method
2 Rounding error This is the error made due to the finite number of digitsavailable on a computer
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1038
10 Applied Numerical Methods
Example 19
By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then
f (x) = f (x + h) minus f (x)
h +
1
h
x+h
x
f (t)(x + h minus t)dt
and
917501917501917501917501917501 1
h
x+h
x
f (t)(x + h minus t)dt
917501917501917501917501917501 le ch
Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations
Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small
using a calculator having 11 digits The following results were obtained
h ln(3 + h) minus ln(3)
h Error =
1
3 minus ln(3 + h) minus ln(3)
h = O(h)
10minus1 03278982 544 times10minus3
10minus2 0332779 554 times10minus4
10minus3 03332778 555 times10minus5
10minus4 0333328 533 times10minus6
10minus5 0333330 333 times10minus6
10minus6 0333300 333 times10minus5
10minus7 0333 333 times10minus4
10minus8 033 333 times10minus3
10minus9 03 333 times10minus2
10minus10
00 333 times10minus1
One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)
There are two possible ways to reduce rounding error
1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider
f (x) = f (x + h) minus f (x minus h)
2h + error where error is O(h2)
h ln(3 + h) minus ln(3 minus h)2h error
01 03334568 124 times10minus4
001 03333345 123 times10minus6
0001 03333333 191 times10minus8
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 538
Mathematical Review and Computer Arithmetic 5
n even Taylorrsquos theorem gives
n P n Rn
2 x minus x3
3 cos(c2)
4 x minus x3
3
x5
5 cos(c4)
6 x minus x3
3 +
x5
5 minus x7
7 cos(c6)
n mdash (minus1)n2 xn+1
(n + 1) cos(cn)
Observing that  cos(cn) le 1 we see that
Rn(x) le xn+1
(n + 1)
We may thus form the following table
n bound on error Rn
2 167 times 10minus4
4 833 times 10minus8
6 198 times 10minus11
8 276 times 10minus15
10 251 times 10minus19
Thus a polynomial with the required accuracy for x isin [minus01 01] is
p(x) = x minus x3
3 +
x5
5 minus x7
7 +
x9
9
An important special case of Taylorrsquos theorem is obtained with n = 0 (thatis directly from the Fundamental Theorem of Calculus)
THEOREM 14
(Mean value theorem) Suppose f isin C 1[a b] x isin [a b] and y isin [a b] (andwithout loss of generality x le y) Then there is a c isin [x y] sube [a b] such that
f (y) minus f (x) = f (c)(y minus x)
Example 14
Suppose f (1) = 1 and f (x) le 2 for x isin [1 2] What are an upper boundand a lower bound on f (2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 638
6 Applied Numerical Methods
The mean value theorem tells us that
f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)
for some c isin (1 2) Furthermore the fact f (x) le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives
1 minus 2 = minus1 le f (2) le 1 + 2 = 3
112 Big ldquoOrdquo Notation
We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get
eh
minus(1 + h) =
1
2h2eξ
where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowlyvarying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality
DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that
E (h)
leChk
for all h le
The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo
Example 15
E (h) = eh minus h minus 1 Then E (h) = O(h2)
PROOF By Taylorrsquos Theorem
eh = e0 + e0(hminus
0) + h2
2 eξ
for some c between 0 and h Thus
E (h) = eh minus 1 minus h le h2
e1
2
and E (h) ge 0
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 738
Mathematical Review and Computer Arithmetic 7
for h le 1 that is = 1 and C = e2 work
Example 16
Show that917501917501f (x+h)minusf (x)
h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f
has two continuous derivatives at each point in [a b]
PROOF
917501917501917501917501
f (x + h) minus f (x)
h minus f (x)
917501917501917501917501
=
917501917501917501917501917501917501917501917501917501f (x) + f (x)h +
x+h
x
(x + h minus t)f (t)dt minus f (x)
h minus f (x)
917501917501917501917501917501917501917501917501917501=
1
h
917501917501917501917501917501 x+h
x
(x + h minus t)f (t)dt
917501917501917501917501917501 le maxaletleb
f (t) h
2 = ch
113 Convergence Rates
DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that xk+1 minusxlowast le C xk minusxlowastα for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic
Example 17
A sequence sometimes learned in elementary classes for computing the squareroot of a number a is
xk+1 = xk
2 +
a
2xk
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 838
8 Applied Numerical Methods
We have
xk+1 minus radic a = xk
2 + a
2xkminus radic a
= xk minus x2k minus a
2xkminus radic
a
= (xk minus radic a) minus (xk minus radic
a)xk +
radic a
2xk
= (xk minus radic a)
1 minus xk +
radic a
2xk
= (xk minus radic a)
xk minus radic a
2xk
= 1
2xk
(xk
minus
radic a)2
asymp 1
2radic
a(xk minus radic
a)2
for xk near radic
a thus showing that the convergence rate is quadratic
Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table
k xk xk minus radic 2 xk minus radic 2(xkminus1 minus radic
2)2
0 2 05858 times 100 mdash
1 15 08579 times 10minus1 02500
2 1416666666666667 02453 times 10minus2 03333
3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535
5 1414213562373095 02204 times 10minus17 mdash
In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the
multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2
radic 2) asymp 0353553390593274 (We
explain how we computed the table at the end of this chapter)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 938
Mathematical Review and Computer Arithmetic 9
Example 18
As an example of linear convergence consider the iteration
xk+1 = xk minus x2k
35 +
2
35
which converges toradic
2 We obtain the following table
k xk xk minus radic 2
xk minus radic 2
(xkminus1 minus radic 2)
0 2 05858 times 100 mdash
1 1428571428571429 01436 times 10minus1 02451 times 10minus1
2 1416909620991254 02696 times 10minus2 01878
3 1414728799831946 05152 times 10minus3
019114 1414312349239392 09879 times 10minus4 01917
5 1414232514607664 01895 times 10minus4 01918
6 1414217198786659 03636 times 10minus5 01919
7 1414214260116949 06955 times 10minus6 01919
8 1414213696254626 01339 times 10minus6 01919
19 1414213562373097 01554 times 10minus14 mdash
Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase5 digit on each iteration
12 Computer Arithmetic
In numerical solution of mathematical problems two common types of errorare
1 Method (algorithm or truncation) error This is the error due to approximations made in the numerical method
2 Rounding error This is the error made due to the finite number of digitsavailable on a computer
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1038
10 Applied Numerical Methods
Example 19
By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then
f (x) = f (x + h) minus f (x)
h +
1
h
x+h
x
f (t)(x + h minus t)dt
and
917501917501917501917501917501 1
h
x+h
x
f (t)(x + h minus t)dt
917501917501917501917501917501 le ch
Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations
Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small
using a calculator having 11 digits The following results were obtained
h ln(3 + h) minus ln(3)
h Error =
1
3 minus ln(3 + h) minus ln(3)
h = O(h)
10minus1 03278982 544 times10minus3
10minus2 0332779 554 times10minus4
10minus3 03332778 555 times10minus5
10minus4 0333328 533 times10minus6
10minus5 0333330 333 times10minus6
10minus6 0333300 333 times10minus5
10minus7 0333 333 times10minus4
10minus8 033 333 times10minus3
10minus9 03 333 times10minus2
10minus10
00 333 times10minus1
One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)
There are two possible ways to reduce rounding error
1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider
f (x) = f (x + h) minus f (x minus h)
2h + error where error is O(h2)
h ln(3 + h) minus ln(3 minus h)2h error
01 03334568 124 times10minus4
001 03333345 123 times10minus6
0001 03333333 191 times10minus8
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 638
6 Applied Numerical Methods
The mean value theorem tells us that
f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)
for some c isin (1 2) Furthermore the fact f (x) le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives
1 minus 2 = minus1 le f (2) le 1 + 2 = 3
112 Big ldquoOrdquo Notation
We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get
eh
minus(1 + h) =
1
2h2eξ
where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowlyvarying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality
DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that
E (h)
leChk
for all h le
The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo
Example 15
E (h) = eh minus h minus 1 Then E (h) = O(h2)
PROOF By Taylorrsquos Theorem
eh = e0 + e0(hminus
0) + h2
2 eξ
for some c between 0 and h Thus
E (h) = eh minus 1 minus h le h2
e1
2
and E (h) ge 0
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 738
Mathematical Review and Computer Arithmetic 7
for h le 1 that is = 1 and C = e2 work
Example 16
Show that917501917501f (x+h)minusf (x)
h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f
has two continuous derivatives at each point in [a b]
PROOF
917501917501917501917501
f (x + h) minus f (x)
h minus f (x)
917501917501917501917501
=
917501917501917501917501917501917501917501917501917501f (x) + f (x)h +
x+h
x
(x + h minus t)f (t)dt minus f (x)
h minus f (x)
917501917501917501917501917501917501917501917501917501=
1
h
917501917501917501917501917501 x+h
x
(x + h minus t)f (t)dt
917501917501917501917501917501 le maxaletleb
f (t) h
2 = ch
113 Convergence Rates
DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that xk+1 minusxlowast le C xk minusxlowastα for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic
Example 17
A sequence sometimes learned in elementary classes for computing the squareroot of a number a is
xk+1 = xk
2 +
a
2xk
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 838
8 Applied Numerical Methods
We have
xk+1 minus radic a = xk
2 + a
2xkminus radic a
= xk minus x2k minus a
2xkminus radic
a
= (xk minus radic a) minus (xk minus radic
a)xk +
radic a
2xk
= (xk minus radic a)
1 minus xk +
radic a
2xk
= (xk minus radic a)
xk minus radic a
2xk
= 1
2xk
(xk
minus
radic a)2
asymp 1
2radic
a(xk minus radic
a)2
for xk near radic
a thus showing that the convergence rate is quadratic
Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table
k xk xk minus radic 2 xk minus radic 2(xkminus1 minus radic
2)2
0 2 05858 times 100 mdash
1 15 08579 times 10minus1 02500
2 1416666666666667 02453 times 10minus2 03333
3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535
5 1414213562373095 02204 times 10minus17 mdash
In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the
multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2
radic 2) asymp 0353553390593274 (We
explain how we computed the table at the end of this chapter)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 938
Mathematical Review and Computer Arithmetic 9
Example 18
As an example of linear convergence consider the iteration
xk+1 = xk minus x2k
35 +
2
35
which converges toradic
2 We obtain the following table
k xk xk minus radic 2
xk minus radic 2
(xkminus1 minus radic 2)
0 2 05858 times 100 mdash
1 1428571428571429 01436 times 10minus1 02451 times 10minus1
2 1416909620991254 02696 times 10minus2 01878
3 1414728799831946 05152 times 10minus3
019114 1414312349239392 09879 times 10minus4 01917
5 1414232514607664 01895 times 10minus4 01918
6 1414217198786659 03636 times 10minus5 01919
7 1414214260116949 06955 times 10minus6 01919
8 1414213696254626 01339 times 10minus6 01919
19 1414213562373097 01554 times 10minus14 mdash
Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase5 digit on each iteration
12 Computer Arithmetic
In numerical solution of mathematical problems two common types of errorare
1 Method (algorithm or truncation) error This is the error due to approximations made in the numerical method
2 Rounding error This is the error made due to the finite number of digitsavailable on a computer
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1038
10 Applied Numerical Methods
Example 19
By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then
f (x) = f (x + h) minus f (x)
h +
1
h
x+h
x
f (t)(x + h minus t)dt
and
917501917501917501917501917501 1
h
x+h
x
f (t)(x + h minus t)dt
917501917501917501917501917501 le ch
Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations
Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small
using a calculator having 11 digits The following results were obtained
h ln(3 + h) minus ln(3)
h Error =
1
3 minus ln(3 + h) minus ln(3)
h = O(h)
10minus1 03278982 544 times10minus3
10minus2 0332779 554 times10minus4
10minus3 03332778 555 times10minus5
10minus4 0333328 533 times10minus6
10minus5 0333330 333 times10minus6
10minus6 0333300 333 times10minus5
10minus7 0333 333 times10minus4
10minus8 033 333 times10minus3
10minus9 03 333 times10minus2
10minus10
00 333 times10minus1
One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)
There are two possible ways to reduce rounding error
1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider
f (x) = f (x + h) minus f (x minus h)
2h + error where error is O(h2)
h ln(3 + h) minus ln(3 minus h)2h error
01 03334568 124 times10minus4
001 03333345 123 times10minus6
0001 03333333 191 times10minus8
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 738
Mathematical Review and Computer Arithmetic 7
for h le 1 that is = 1 and C = e2 work
Example 16
Show that917501917501f (x+h)minusf (x)
h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f
has two continuous derivatives at each point in [a b]
PROOF
917501917501917501917501
f (x + h) minus f (x)
h minus f (x)
917501917501917501917501
=
917501917501917501917501917501917501917501917501917501f (x) + f (x)h +
x+h
x
(x + h minus t)f (t)dt minus f (x)
h minus f (x)
917501917501917501917501917501917501917501917501917501=
1
h
917501917501917501917501917501 x+h
x
(x + h minus t)f (t)dt
917501917501917501917501917501 le maxaletleb
f (t) h
2 = ch
113 Convergence Rates
DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that xk+1 minusxlowast le C xk minusxlowastα for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic
Example 17
A sequence sometimes learned in elementary classes for computing the squareroot of a number a is
xk+1 = xk
2 +
a
2xk
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 838
8 Applied Numerical Methods
We have
xk+1 minus radic a = xk
2 + a
2xkminus radic a
= xk minus x2k minus a
2xkminus radic
a
= (xk minus radic a) minus (xk minus radic
a)xk +
radic a
2xk
= (xk minus radic a)
1 minus xk +
radic a
2xk
= (xk minus radic a)
xk minus radic a
2xk
= 1
2xk
(xk
minus
radic a)2
asymp 1
2radic
a(xk minus radic
a)2
for xk near radic
a thus showing that the convergence rate is quadratic
Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table
k xk xk minus radic 2 xk minus radic 2(xkminus1 minus radic
2)2
0 2 05858 times 100 mdash
1 15 08579 times 10minus1 02500
2 1416666666666667 02453 times 10minus2 03333
3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535
5 1414213562373095 02204 times 10minus17 mdash
In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the
multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2
radic 2) asymp 0353553390593274 (We
explain how we computed the table at the end of this chapter)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 938
Mathematical Review and Computer Arithmetic 9
Example 18
As an example of linear convergence consider the iteration
xk+1 = xk minus x2k
35 +
2
35
which converges toradic
2 We obtain the following table
k xk xk minus radic 2
xk minus radic 2
(xkminus1 minus radic 2)
0 2 05858 times 100 mdash
1 1428571428571429 01436 times 10minus1 02451 times 10minus1
2 1416909620991254 02696 times 10minus2 01878
3 1414728799831946 05152 times 10minus3
019114 1414312349239392 09879 times 10minus4 01917
5 1414232514607664 01895 times 10minus4 01918
6 1414217198786659 03636 times 10minus5 01919
7 1414214260116949 06955 times 10minus6 01919
8 1414213696254626 01339 times 10minus6 01919
19 1414213562373097 01554 times 10minus14 mdash
Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase5 digit on each iteration
12 Computer Arithmetic
In numerical solution of mathematical problems two common types of errorare
1 Method (algorithm or truncation) error This is the error due to approximations made in the numerical method
2 Rounding error This is the error made due to the finite number of digitsavailable on a computer
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1038
10 Applied Numerical Methods
Example 19
By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then
f (x) = f (x + h) minus f (x)
h +
1
h
x+h
x
f (t)(x + h minus t)dt
and
917501917501917501917501917501 1
h
x+h
x
f (t)(x + h minus t)dt
917501917501917501917501917501 le ch
Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations
Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small
using a calculator having 11 digits The following results were obtained
h ln(3 + h) minus ln(3)
h Error =
1
3 minus ln(3 + h) minus ln(3)
h = O(h)
10minus1 03278982 544 times10minus3
10minus2 0332779 554 times10minus4
10minus3 03332778 555 times10minus5
10minus4 0333328 533 times10minus6
10minus5 0333330 333 times10minus6
10minus6 0333300 333 times10minus5
10minus7 0333 333 times10minus4
10minus8 033 333 times10minus3
10minus9 03 333 times10minus2
10minus10
00 333 times10minus1
One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)
There are two possible ways to reduce rounding error
1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider
f (x) = f (x + h) minus f (x minus h)
2h + error where error is O(h2)
h ln(3 + h) minus ln(3 minus h)2h error
01 03334568 124 times10minus4
001 03333345 123 times10minus6
0001 03333333 191 times10minus8
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 838
8 Applied Numerical Methods
We have
xk+1 minus radic a = xk
2 + a
2xkminus radic a
= xk minus x2k minus a
2xkminus radic
a
= (xk minus radic a) minus (xk minus radic
a)xk +
radic a
2xk
= (xk minus radic a)
1 minus xk +
radic a
2xk
= (xk minus radic a)
xk minus radic a
2xk
= 1
2xk
(xk
minus
radic a)2
asymp 1
2radic
a(xk minus radic
a)2
for xk near radic
a thus showing that the convergence rate is quadratic
Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table
k xk xk minus radic 2 xk minus radic 2(xkminus1 minus radic
2)2
0 2 05858 times 100 mdash
1 15 08579 times 10minus1 02500
2 1416666666666667 02453 times 10minus2 03333
3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535
5 1414213562373095 02204 times 10minus17 mdash
In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the
multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2
radic 2) asymp 0353553390593274 (We
explain how we computed the table at the end of this chapter)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 938
Mathematical Review and Computer Arithmetic 9
Example 18
As an example of linear convergence consider the iteration
xk+1 = xk minus x2k
35 +
2
35
which converges toradic
2 We obtain the following table
k xk xk minus radic 2
xk minus radic 2
(xkminus1 minus radic 2)
0 2 05858 times 100 mdash
1 1428571428571429 01436 times 10minus1 02451 times 10minus1
2 1416909620991254 02696 times 10minus2 01878
3 1414728799831946 05152 times 10minus3
019114 1414312349239392 09879 times 10minus4 01917
5 1414232514607664 01895 times 10minus4 01918
6 1414217198786659 03636 times 10minus5 01919
7 1414214260116949 06955 times 10minus6 01919
8 1414213696254626 01339 times 10minus6 01919
19 1414213562373097 01554 times 10minus14 mdash
Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase5 digit on each iteration
12 Computer Arithmetic
In numerical solution of mathematical problems two common types of errorare
1 Method (algorithm or truncation) error This is the error due to approximations made in the numerical method
2 Rounding error This is the error made due to the finite number of digitsavailable on a computer
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1038
10 Applied Numerical Methods
Example 19
By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then
f (x) = f (x + h) minus f (x)
h +
1
h
x+h
x
f (t)(x + h minus t)dt
and
917501917501917501917501917501 1
h
x+h
x
f (t)(x + h minus t)dt
917501917501917501917501917501 le ch
Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations
Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small
using a calculator having 11 digits The following results were obtained
h ln(3 + h) minus ln(3)
h Error =
1
3 minus ln(3 + h) minus ln(3)
h = O(h)
10minus1 03278982 544 times10minus3
10minus2 0332779 554 times10minus4
10minus3 03332778 555 times10minus5
10minus4 0333328 533 times10minus6
10minus5 0333330 333 times10minus6
10minus6 0333300 333 times10minus5
10minus7 0333 333 times10minus4
10minus8 033 333 times10minus3
10minus9 03 333 times10minus2
10minus10
00 333 times10minus1
One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)
There are two possible ways to reduce rounding error
1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider
f (x) = f (x + h) minus f (x minus h)
2h + error where error is O(h2)
h ln(3 + h) minus ln(3 minus h)2h error
01 03334568 124 times10minus4
001 03333345 123 times10minus6
0001 03333333 191 times10minus8
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 938
Mathematical Review and Computer Arithmetic 9
Example 18
As an example of linear convergence consider the iteration
xk+1 = xk minus x2k
35 +
2
35
which converges toradic
2 We obtain the following table
k xk xk minus radic 2
xk minus radic 2
(xkminus1 minus radic 2)
0 2 05858 times 100 mdash
1 1428571428571429 01436 times 10minus1 02451 times 10minus1
2 1416909620991254 02696 times 10minus2 01878
3 1414728799831946 05152 times 10minus3
019114 1414312349239392 09879 times 10minus4 01917
5 1414232514607664 01895 times 10minus4 01918
6 1414217198786659 03636 times 10minus5 01919
7 1414214260116949 06955 times 10minus6 01919
8 1414213696254626 01339 times 10minus6 01919
19 1414213562373097 01554 times 10minus14 mdash
Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase5 digit on each iteration
12 Computer Arithmetic
In numerical solution of mathematical problems two common types of errorare
1 Method (algorithm or truncation) error This is the error due to approximations made in the numerical method
2 Rounding error This is the error made due to the finite number of digitsavailable on a computer
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1038
10 Applied Numerical Methods
Example 19
By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then
f (x) = f (x + h) minus f (x)
h +
1
h
x+h
x
f (t)(x + h minus t)dt
and
917501917501917501917501917501 1
h
x+h
x
f (t)(x + h minus t)dt
917501917501917501917501917501 le ch
Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations
Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small
using a calculator having 11 digits The following results were obtained
h ln(3 + h) minus ln(3)
h Error =
1
3 minus ln(3 + h) minus ln(3)
h = O(h)
10minus1 03278982 544 times10minus3
10minus2 0332779 554 times10minus4
10minus3 03332778 555 times10minus5
10minus4 0333328 533 times10minus6
10minus5 0333330 333 times10minus6
10minus6 0333300 333 times10minus5
10minus7 0333 333 times10minus4
10minus8 033 333 times10minus3
10minus9 03 333 times10minus2
10minus10
00 333 times10minus1
One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)
There are two possible ways to reduce rounding error
1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider
f (x) = f (x + h) minus f (x minus h)
2h + error where error is O(h2)
h ln(3 + h) minus ln(3 minus h)2h error
01 03334568 124 times10minus4
001 03333345 123 times10minus6
0001 03333333 191 times10minus8
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1038
10 Applied Numerical Methods
Example 19
By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then
f (x) = f (x + h) minus f (x)
h +
1
h
x+h
x
f (t)(x + h minus t)dt
and
917501917501917501917501917501 1
h
x+h
x
f (t)(x + h minus t)dt
917501917501917501917501917501 le ch
Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations
Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small
using a calculator having 11 digits The following results were obtained
h ln(3 + h) minus ln(3)
h Error =
1
3 minus ln(3 + h) minus ln(3)
h = O(h)
10minus1 03278982 544 times10minus3
10minus2 0332779 554 times10minus4
10minus3 03332778 555 times10minus5
10minus4 0333328 533 times10minus6
10minus5 0333330 333 times10minus6
10minus6 0333300 333 times10minus5
10minus7 0333 333 times10minus4
10minus8 033 333 times10minus3
10minus9 03 333 times10minus2
10minus10
00 333 times10minus1
One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)
There are two possible ways to reduce rounding error
1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider
f (x) = f (x + h) minus f (x minus h)
2h + error where error is O(h2)
h ln(3 + h) minus ln(3 minus h)2h error
01 03334568 124 times10minus4
001 03333345 123 times10minus6
0001 03333333 191 times10minus8
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1138
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10
2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic
To fully understand and avoid roundoff error we should study some details of how computers and calculators represent and work with approximatenumbers
121 Floating Point Arithmetic and Rounding Error
Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation
x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m
where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1
On a computer we are restricted to a finite set of floatingpoint numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)
Example 110
(binary) β = 2
xlowast = (01011)23 =
1 times 12
+ 0 times 14
+ 1 times 18
+ 1 times 116
times 8
= 11
2 = 55 (decimal)
REMARK 11 Most numbers cannot be exactly represented on a computer Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24
Question Given a real number x how do we define a floating point number
fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machinerepresentable number fl (x)
round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1238
12 Applied Numerical Methods
round up fl (x) = x uarr the nearest machine number to the real number x
that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number
x
round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the tth one
The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result
Example 111
β = 10 t = 5 x = 012345666 middot middot middot times 107 Then
fl (x) = 012345 times 107 (chopping)
fl (x) = 012346 times 107 (rounded to nearest)
(In this case round down corresponds to chopping and round up correspondsto round to nearest)
See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101
01 times 102
+
β mminus1 = 1
+ + + + + + + + + +
β mminust = 100 = 1
successive floating point numbersβ m = 101
FIGURE 12 An example floating point system β = 10 t = 1 andm = 1
Example 112
Let a = 0410 b = 0000135 and c = 0000431 Assuming 3digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1338
Mathematical Review and Computer Arithmetic 13
Following the ldquorounding to nearestrdquo definition of fl we emulate the opera
tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3
and
fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)
= fl (0566 times 10minus3)
= 0566 times 10minus3
so
fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)
= fl (0410 times 100
+ 0000566 times 100
)= fl (0410566 times 100)
= 0411 times 100
On the other hand
fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)
= fl (0410000 times 100 + 0000135 times 100)
= fl (0410135 times 100)
= 0410 times 100
so
fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)
= fl (0410 times 100 + 0000431 times 100)
= fl (0410431 times 100)
= 0410 times 100 = 0411 times 100
Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum
The following error bound is useful in some analyses
THEOREM 15
x minus fl (x) le 1
2xβ 1minust p
where p = 1 for rounding and p = 2 for chopping
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational
definitions x + y = [x + y x + y]
xminus y = [x minus y x minus y]
xtimes y = [minxyxyxyxy maxxyxyxyxy]
1
x = [
1
x 1
x] if x gt 0 or x lt 0
xdivide y = x times 1
y
(12)
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if
f (x) = (x + 1)(x minus 1) (13)
then
f ([minus2 2]) =
[minus2 2] + 1
[minus2 2] minus 1
= [minus1 3][minus3 1] = [minus9 3]
which contains the exact range [minus1 3]
REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example
[1 2]
[minus3 4]
= minusinfin
minus
1
3
1
4
infin= R
lowastminus
1
3
1
4
where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith
4also known as the twopoint compactification of the real numbers
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2638
26 Applied Numerical Methods
metic5 was originally invented by William Kahan6 for computations with con
tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semiinfinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)
The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error
Interval arithmetic is only subdistributive That is if x y and z areintervals then
x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)
As a result algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite
f (x) = x2 minus 1 (15)
and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows
ALGORITHM 11
(Computing an interval whose end points are machine numbers and which encloses the range of x2)
5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2738
Mathematical Review and Computer Arithmetic 27
INPUT x = [x x]
OUTPUT a machinerepresentable interval that contains the range of x
2
overx
IF x ge 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE IF x le 0 THEN
RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding
ELSE
1 Compute x2 and x2 with both downward and upward rounding that iscompute x2
l and x2u such that x2
l and x2u are machine representable num
bers and x2 isin [x2l x2
u] and compute x2l and x2
u such that x2l and x2
u are machine representable numbers and x2 isin [x2
l x2u]
2 RETURN [0 max
x2u x2
u
]
END IF
END ALGORITHM 11
With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain
f ([minus2 2]) = [minus2 2]2
minus 1 = [0 4] minus 1 = [minus1 3]
which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression
occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as
THEOREM 18
(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can contain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then
1 The interval value f (x1 xn) contains the range of f over the interval vector (or box) (x1 xn)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2838
28 Applied Numerical Methods
2 If the single functions (the elementary operations and functions xn etc)
have interval values that represent their exact ranges and if each variable xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals
If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]
We now present some definitions and theorems to clarify the practical con
sequences of interval dependency
DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE
Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely
DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives
[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)
For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have
THEOREM 19
Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n
Then if E (f x1 xn) is as in Definition 17 we have
E (f x1 xn) = O max1leilen
w(xi)
where w(x) denotes the width of the interval x
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2938
Mathematical Review and Computer Arithmetic 29
That is the overestimation becomes less as the uncertainty in the arguments
to the function becomes smallerInterval evaluations as in Theorem 19 are termed firstorder interval extensions It is not difficult to obtain secondorder extensions where required(See Exercise below)
132 Application of Interval Arithmetic Examples
We give one such example here
Example 122
Using 4digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct
Solution The fifth degree Taylor polynomial representation for e is
e = 1 + 1 + 1
2 +
1
3 +
1
4 +
1
5 +
1
6eξ
for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex
is an increasing function of x then the error term is bounded by917501917501917501917501 1
6eξ917501917501917501917501 le 3
6 lt 0005
so this fifthdegree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation
[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]
[2000 2000] + [05000 05000] rarr [2500 2500]
[1000 1000][6000 6000] rarr [01666 01667]
[2500 2500] + [01666 01667] rarr [2666 2667]
[1000 1000][2400 2400] rarr [004166 004167]
[2666 2667] + [004166 004167] rarr [2707 2709]
[1000 1000][1200 1200] rarr [0008333 0008334]
[2707 2709] + [0008333 0008334] rarr [2715 2718]
[1000 1000][7200 7200] rarr [0001388 0001389]
[001388 001389]times
[1 3]rarr
[0001388 0004167]
[2715 2718] + [0001388 0004167] rarr [2716 2723]
Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3038
30 Applied Numerical Methods
1 These computations can be done automatically on a computer as simply
as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62
2 The solution is illustrative More sophisticated methods such as argument reduction would be used in practice to bound values of ex moreaccurately and with less operations
Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these
A general introduction to interval computations is [26] That work gives
not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have matlab available we recommend intlab for the exercises in this book involvinginterval computations
14 Programming Environments
Modern scientific computing (with floating point numbers) is usually done
with highlevel ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming languages Common programming environments in use for general scientific computing today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of highquality publicly available software in Fortran for common computations in numerical analysis and scientific computing Such software can be found for example on NETLIB at
httpwwwnetliborg
Fortran has evolved over the years becoming a modern multifaceted language with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance
especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper
ating system was originally meant to be a higherlevel language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3138
Mathematical Review and Computer Arithmetic 31
guage7 to allow the objectoriented programming paradigm In recent years
computer science departments have favored teaching C++ over teaching Fortran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain largescale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77
Reasonably highquality compilers for both Fortran and CC++ are available free of charge with Linux operating systems Fortran 2003 largely implemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These compilers include interactive graphicaluserinterfaceoriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows
The matlab
system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with matrices and vectors but has since evolved to be both an interactive environmentand fullfeatured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency
One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe objectoriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave
7with others including Fortran to follow
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3238
32 Applied Numerical Methods
Alternative systems sometimes used for scientific computing are computer
algebra systems Perhaps the most common of these are Mathematica
andMaple while a free such system under development is ldquoSAGErdquo These systems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automatically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing
We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system
The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book
matlab programs can be written as matlab scripts and matlab functions
Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is
a = 2
x=2
xold=x
err_old = 1
for k=010
k
x
err = x  sqrt(2)
err
ratio = errerr_old^2
err_old = err
x = x2 + 1x
end
Example 124
The matlab script we used to produce the table in Example 18 (on page 9)is
format long
a = 2
x=2
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3338
Mathematical Review and Computer Arithmetic 33
xold=x
err_old = 1
for k=025
k
x
err = x  sqrt(2)
err
ratio = errerr_old
err_old = err
x = x  x^235 + 235
end
An excellent alternative text book that focuses on matlab functions is
Cleve Molerrsquos Numerical Computing with Matlab [25] An online versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks
commolerchaptershtml
15 Applications
The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolution of ecological systems and epidemics designing and constructing durable
but inexpensive bridges buildings roads water control structures understanding chemical and physical processes designing chemical plants and electronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps
Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified
Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe
matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to determine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For example the theory of relativity and variations in the acceleration of gravity
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3438
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth
Construction In this step we actually translate the problem into mathematical language
Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution
process We may need to simplify the process to enable practical solution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and threedimensional graphs and movies
Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions
Validation Compare the model to real data For example in climate models the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes
Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approximation error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclusions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events
Errors in the modeling process can sometimes be quantified in the solution
process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can sometimes be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3538
Mathematical Review and Computer Arithmetic 35
applicable in different situations are statistical methods and computing solu
tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas
Example 125
The formula for the net capacitance when two capacitors of values x and yare connected in series is
z = xy
x + y
Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value
In this example the identification assumptions and construction have already been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the computation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain
z = 11x + 1
y
We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window
gtgt intvalinit(rsquoDisplayInfsuprsquo)
===gt Default display of intervals by infimumsupremum
gtgt x = intval(rsquo[0911]rsquo)
intval x =
[ 08999 11001]
gtgt y = intval(rsquo[1822]rsquo)
intval y =
[ 17999 22001]
gtgt z = 1(1x + 1y)
intval z =
[ 05999 07334]
gtgt format long
gtgt z
intval z =
8If one has matlab intlab is available free of charge for noncommercial use from http
wwwti3tuharburgde~rumpintlab
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3638
36 Applied Numerical Methods
[ 059999999999999 073333333333334]
gtgt
Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff
errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab
16 Exercises
1 Write down a polynomial p(x) such thatS(x)
minus p(x)
 le10minus10 for
minus02
lex le 02 where
S(x) =
sin(x)
x if x = 0
1 if x = 0
Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (wellknown in signal processing etc)
(a) Show that your polynomial p satisfies the condition sinc(x) minus p(x) le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms
for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier
than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds
(b) Plot your polynomial approximation and sinc(x) on the same graph
(i) over the interval [minus02 02]
(ii) over the interval [minus3 3]
(iii) over the interval [minus10 10]
2 Suppose f has a continuous third derivative Show that
917501917501917501917501f (x + h) minus f (x minus h)
2h minus f (x)917501917501917501917501 = O(h2)
3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)
h2 minus f (x)
917501917501917501917501 = O(h2)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3738
Mathematical Review and Computer Arithmetic 37
4 Let a = 041 b = 036 and c = 07 Assuming a 2digit decimal com
puter arithmetic with rounding show that a
minusb
c = a
c minus b
c when usingthis arithmetic
5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20
6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the computer arithmetic underlying matlab
eps = 1
x = 1+eps
while(x~=1)
eps = eps2x = 1+eps
end
eps = eps+(2eps)^2
y = 1+eps
y1
7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system
(a) What is HUGE for this system
(b) What is TINY for this system
(c) What is the machine epsilon m for this system
(d) Let f (x) = sin(x) + 1
i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system
ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)
iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)
8 Let f (x) = ln(x + 1) minus ln(x)
2
(a) Use fourdigit decimal arithmetic with rounding to evaluatef (100 000)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 3838
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f (x) in a form that
avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again
(c) Compare the relative errors for the answers obtained in (a) and(b)
9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss
any possible illconditioning
10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1438
14 Applied Numerical Methods
DEFINITION 13 δ = p
2β 1minust is called the unit roundoff error
Let = fl (x) minus x
x Then fl (x) = (1 + )x where  le δ With this we have
the following
THEOREM 16
Let 1048573 denote the operation + minus times or divide and let x and y be machine numbers Then
fl (x 1048573 y) = (x 1048573 y)(1 + ) where  le δ = p
2 β 1minust
Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis
Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large
Example 113
Suppose β = 10 and t = 3 (3digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that 104π minus x lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However 104π minus x = 15926 Hence it is impossible to find a number xin the system with
104π
minusx
lt 10minus2
The error 104π minus x in this example is called the absolute error in approximating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error
DEFINITION 14 Let xlowast be an approximation to x Then x minus xlowast is
called the absolute error and
917501917501917501917501x minus xlowast
x
917501917501917501917501 is called the relative error
For example 917501917501917501917501x minus fl (x)
x 917501917501917501917501 le δ = p
2 β 1minust (unit roundoff error)
1211 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoff error can become large and explain how to avoid many of these situations
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1538
Mathematical Review and Computer Arithmetic 15
Example 114
β = 10 t = 4 p = 1 (Thus δ =
1
210minus3
= 00005) Let x = 05795 times 10
5
y = 06399 times 105 Then
fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 1 lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 2 lt δ
(Note x + y = 012194 times 106 xy = 037082205 times 1010)
Example 115
Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then
fl (x1 + x2) = 10000
fl (x1 + x2 + x3) = 10000
fl
10011048573i=1
xi
= 10000
when we sum forward from x1 But going backwards
fl (x1001 + x1000) = 2
fl (x1001 + x1000 + x999) = 3
fl
11048573
i=1001
xi
= 11000
which is the correct sum
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number or when a very large number of small almostequal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two
adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1638
16 Applied Numerical Methods
Example 116
x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))
x2 minus x1
917501917501917501917501 = 0000131 minus 00001
0000131
= 0237
= 237 relative accuracy
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this
source as the following example illustrates
Example 117
Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith 4ac The most common formula for the roots is
x12 = minusb plusmn radic
b2 minus 4ac
2a
Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4digit choppedarithmetic Then
x1 = minus100 +
radic 9996
2 x2 = minus100
minus
radic 9996
2
butradic
9996 asymp 9997 (4 digit arithmetic chopped) Thus
x1 asymp minus100 + 9997
2 x2 asymp minus100 minus 9997
2
Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively
Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then
x1 = minusb +
radic b2 minus 4ac
2a
minusb minus radic
b2 minus 4ac
minusb minus radic b2 minus 4ac
= 4ac2a(minusb minus radic
b2 minus 4ac)= minus2c
b +radic
b2 minus 4ac
and
x2 = minusb minus radic
b2 minus 4ac
2a (the same as before)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1738
Mathematical Review and Computer Arithmetic 17
Then for the above values
x1 = minus2(1)
100 +radic
9996asymp minus2
100 + 9997 = minus00100
Now the relative error in x1 is also 001
Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem
THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)
f (x)
917501917501917501917501 asymp917501917501917501917501x f (x)
f (x)
917501917501917501917501917501917501917501917501x minus xlowast
x
917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small
values of x minus xlowast is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result
This leads us to the following definition
DEFINITION 15 The condition number of a function f (x) is
κf (x) =
917501917501917501917501x f (x)
f (x)
917501917501917501917501The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x
Example 118
Let f (x) =radic
x The condition number of f (x) about x is
κf (x) =
917501917501917501917501917501x 12radic xradic
x
917501917501917501917501917501 = 1
2
This suggests that f (x) is wellconditioned
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1838
18 Applied Numerical Methods
Example 119
Let f (x) =
radic x minus 2 The condition number of f (x) about x is
κf (x) =
917501917501917501917501 x
2(x minus 2)
917501917501917501917501
This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand illconditioned for values of x close to 2
REMARK 12 If x = f (x) = 0 then the condition number is simplyf (x) If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes f (x)f (x)
REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)
122 Practicalities and the IEEE Floating Point Standard
Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM mainframes generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 19551965 the ldquoSetunrdquo even used base3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers while some ldquosupercomputersrdquo (such as the Cray1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers
Some handheld calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10 β = 10)
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 1938
Mathematical Review and Computer Arithmetic 19
arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man
tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2
Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system
To promote predictability portability reliability and rigorous error bounding in floating point computations the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary floating point arithmetic in 1985 IEEEANSI 7541985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in existence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard
In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted
from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1
and23
i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and
52i=0 2minusi asymp 2
Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems
arising from partial differential equation models more digits or a larger exponent range is required than is available with IEEE single precision For
2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2038
20 Applied Numerical Methods
TABLE 11 Parameters for
IEEE arithmeticprecision β L U t
single 2 126 127 24double 2 1022 1023 53
this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision
IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround
to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times
radic middotThe default mode (if the rounding mode is not explicitly set) is normally
ldquoround to nearestrdquo to give an approximation after a long string of computations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)
Several parameters more directly related to numerical computations than
L U and t are associated with any floating point number system These are
HUGE the largest representable number in the floating point system
TINY the smallest positive representable number in the floating point system
m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest
These socalled ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems
For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system or has absolute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow occurs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2138
Mathematical Review and Computer Arithmetic 21
TABLE 12 Machine constants for IEEE arithmetic
Precision HUGE TINY m
single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8
double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16
when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred
The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as
radic minus
1 10 minus
10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of
radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Computation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic
Although the IEEE nonstop arithmetic is useful in many contexts the numerical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denormalized numbers are used In fact many programming languages by default
or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option
Example 120
IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog
gtgt epsm = eps(1d0)
epsm = 22204e016
gtgt TINY = realmin
TINY = 22251e308gtgt HUGE = realmax
HUGE = 17977e+308
gtgt 1TINY
ans = 44942e+307
gtgt 1HUGE
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2238
22 Applied Numerical Methods
ans = 55627e309
gtgt HUGE^2
ans = Inf
gtgt TINY^2
ans = 0
gtgt new_val = 1+epsm
new_val = 10000
gtgt new_val  1
ans = 22204e016
gtgt too_small = epsm2
too_small = 11102e016
gtgt not_new = 1+too_small
not_new = 1
gtgt not_new  1
ans = 0
gtgt
Example 121
(Illustration of underflow and overflow) Suppose for the purposes of illustration we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =
x21 + x2
2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system
In older systems an overflow usually would result in stopping the compu
tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2
1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usually set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme
1 slarr
max
x1
x2

2 η1 larr x1s η2 larr x2s
3 N larr s
η21 + η22
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2338
Mathematical Review and Computer Arithmetic 23
1221 Input and Output
For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example 100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)
Thus for input a decimal fraction needs to be converted to a binary floating point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeating binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion
be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)
Thus the number that one sees as output is usually not exactly the number that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the languages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number
This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base8format in which each digit (between 0 and 7) is interpreted as a group of
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2438
24 Applied Numerical Methods
three binary digits or in hexadecimal format (where the digits are 09 A B
C D E F) in which each digit corresponds to a group of four binary digits
1222 Standard Functions
To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 7542008 anofficial update to IEEE 7541985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct rounding ) for those functions a programming language elects to implement
REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina
tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the advance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations
13 Interval Computations
Interval computations are useful for two main purposes
bull to use floating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundoff error)
bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes
In complicated traditional floating point algorithms naive arrangement of in
terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many However used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot
Interval computations are based on interval arithmetic
8132019 num chap 1
httpslidepdfcomreaderfullnumchap1 2538
Mathematical Review and Computer Arithmetic 25
131 Interval Arithmetic
In interval arithmetic we define operations on intervals which can be considered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an operation is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by
x 1048573 y = x 1048573 y  x isin x and y isin y for 1048573 isin + minus times divide (11)
Interval arithmeticrsquos usefu