of 38 /38
Chapter 1 Mathematical Review and Computer Arithmetic 1.1 Mathematical Review The tools of scien tic, engineeri ng, and operations research comp uting are rmly based in the calcu lus. In particula r, formulating and solvin g mathe- matical models in these areas involves approximation of quantities, such as integrals, derivatives, solutions to dierential equations, and solutions to sys- tems of equations, rst seen an a calcul us course. Indee d, tec hniqu es from such a cou rse are the basis of much of sci entic comput ation. We review these techniques here, with particular emphasis on how we will use them. In addition to basic calculus techniques, scientic computing involves ap- pro ximati on of the real number sys tem by dec ima l numbers wit h a xed number of digits in their representation. Except for certain research-oriented systems, computer number systems today for this purpose are  ﬂoating point systems, and almost all such oating point systems in use today adhere to the IEEE 754-2008 oating point standard . We describe oating point num bers and the oating point standard in this chapter, paying particular attention to consequences and pitfalls of its use. Third, programming and software tools are used in scientic computing. Considering how commonly it is used, ease of programming and debugging, documentation, and packages accessible from it, we have elected to use  mat- lab throughout this book. We introduce the basics of  matlab in this chapter. 1.1.1 Inte rmedi ate V alue Theo rem, Mean V alue Theo rems, and Taylor’s Theorem Throughout,  C n [a, b] will denote the set of real-valued functions  f  dened on the interval [ a, b] such that  f  and its derivatives, up to and including its n-th derivative  f (n) , are continuous on [ a, b]. THEOREM 1.1 (Intermed iate value the or em) If  f  ∈  C [a, b]  and  k  is any number between m = min axb f (x)  and  M  = max axb f (x), then there exists a number  c  in  [a, b] 1

# num chap 1

Embed Size (px)

Citation preview

8132019 num chap 1

Chapter 1

Mathematical Review and Computer Arithmetic

11 Mathematical Review

The tools of scientific engineering and operations research computing arefirmly based in the calculus In particular formulating and solving mathe-matical models in these areas involves approximation of quantities such asintegrals derivatives solutions to differential equations and solutions to sys-tems of equations first seen an a calculus course Indeed techniques fromsuch a course are the basis of much of scientific computation We reviewthese techniques here with particular emphasis on how we will use them

In addition to basic calculus techniques scientific computing involves ap-proximation of the real number system by decimal numbers with a fixednumber of digits in their representation Except for certain research-orientedsystems computer number systems today for this purpose are floating point systems and almost all such floating point systems in use today adhere to theIEEE 754-2008 floating point standard We describe floating point numbers

and the floating point standard in this chapter paying particular attention toconsequences and pitfalls of its use

Third programming and software tools are used in scientific computingConsidering how commonly it is used ease of programming and debuggingdocumentation and packages accessible from it we have elected to use mat-lab throughout this book We introduce the basics of matlab in this chapter

111 Intermediate Value Theorem Mean Value Theoremsand Taylorrsquos Theorem

Throughout C n[a b] will denote the set of real-valued functions f definedon the interval [a b] such that f and its derivatives up to and including itsn-th derivative f (n) are continuous on [a b]

THEOREM 11

(Intermediate value theorem) If f isin C [a b] and k is any number between m = min

alexlebf (x) and M = max

alexlebf (x) then there exists a number c in [a b]

1

8132019 num chap 1

2 Applied Numerical Methods

for which f (c) = k (Figure 11)

x

y

y = f (x)

m +

k +

M +

a+

c+

b+

FIGURE 11 Illustration of the Intermediate Value Theorem

Example 11

Consider f (x) = ex minus x minus 2 Using a computational device (such as a calcu-lator) on which we trust the approximation of ex to be accurate we computef (0) = minus1 and f (2) asymp 33891 We know f is continuous since it is a sumof continuous functions Since 0 is between f (0) and f (2) the IntermediateValue Theorem tells us there is a point c isin [0 2] such that f (c) = 0 At sucha c ec = c + 2

THEOREM 12

(Mean value theorem for integrals) Let f be continuous and w be Riemann integrable 1 on [a b] and suppose that w(x) ge 0 for x isin [a b] Then there exists a point c in [a b] such that

ba

w(x)f (x)dx = f (c)

ba

w(x)dx

Example 12

Suppose we want bounds on

1

0

x2eminusx2dx

1This means that the limit of the Riemann sums exists For example w may be continuousor w may have a finite number of breaks

8132019 num chap 1

Mathematical Review and Computer Arithmetic 3

With w(x) = x2 and f (x) = eminusx2 the Mean Value Theorem for integrals tells

us that 10

x2eminusx2dx = eminusc2 10

x2dx

for some c isin [0 1] so

1

3e le 10

x2eminusx2dx = eminusc2 10

x2dx le 1

3

The following is extremely important in scientific and engineering comput-ing

THEOREM 13

(Taylorrsquos theorem) Suppose that f isin C n+1[a b] Let x0 isin [a b] Then for any x isin [a b]

f (x) = P n(x) + Rn(x) where

P n(x) = f (x0) + f (x0)(x minus x0) + middot middot middot + f (n)(x0)(x minus x0)n

n

=

n1048573k=0

1

kf (k)(x0)(x minus x0)k and

Rn(x) = 1

n

xx0

f (n+1)(t)(x minus t)ndt (integral form of remainder)

Furthermore there is a ξ = ξ (x) between x0 and x with

Rn(x) = f (n+1)(ξ (x))(x minus x0)n+1

(n + 1) (Lagrange form of remainder)

PROOF Recall the integration by parts formula1114109

udv = uv minus 1114109 vdu

8132019 num chap 1

4 Applied Numerical Methods

Thus

f (x) minus f (x0) =

xx0

f (t)dt (let u = f (t) v = t minus x dv = dt)

= f (x0)(x minus x0) +

xx0

(x minus t)f (t)dt

(let u = f (t) dv = (x minus t)dt)

= f (x0)(x minus x0) minus (x minus t)2

2 f (t)

917501917501917501917501917501x

x0

+

xx0

(x minus t)2

2 f (t)dt

= f (x0)(x minus x0) + (x minus x0)2

2 f (x0) +

xx0

(x minus t)2

2 f (t)dt

Continuing this procedure

f (x) = f (x0) + f (x0)(x minus x0) + (x minus x0)2

2 f (x0)

+ middot middot middot + (x minus x0)n

n f (n)(x0) +

xx0

(x minus t)n

n f (n+1)(t)dt

= P n(x) + Rn(x)

Now consider Rn(x) = xx0

(x minus t)

n

n f (n+1)(t)dt and assume that x0 lt x (same

argument if x0 gt x) Then by Theorem 12

Rn(x) = f (n+1)(ξ (x))

xx0

(x minus t)n

n dt = f (n+1)(ξ (x))

(x minus x0)n+1

(n + 1)

where ξ is between x0 and x and thus ξ = ξ (x)

Example 13

Approximate sin(x) by a polynomial p(x) such that

|sin(x)

minus p(x)

| le10minus16

for minus01 le x le 01

For Example 13 Taylor polynomials about x0 = 0 are appropriate sincethat is the center of the interval about which we wish to approximate Weobserve that the terms of even degree in such a polynomial are absent so for

8132019 num chap 1

Mathematical Review and Computer Arithmetic 5

n even Taylorrsquos theorem gives

n P n Rn

2 x minus x3

3 cos(c2)

4 x minus x3

3

x5

5 cos(c4)

6 x minus x3

3 +

x5

5 minus x7

7 cos(c6)

n mdash (minus1)n2 xn+1

(n + 1) cos(cn)

Observing that | cos(cn)| le 1 we see that

|Rn(x)| le |x|n+1

(n + 1)

We may thus form the following table

n bound on error Rn

2 167 times 10minus4

4 833 times 10minus8

6 198 times 10minus11

8 276 times 10minus15

10 251 times 10minus19

Thus a polynomial with the required accuracy for x isin [minus01 01] is

p(x) = x minus x3

3 +

x5

5 minus x7

7 +

x9

9

An important special case of Taylorrsquos theorem is obtained with n = 0 (thatis directly from the Fundamental Theorem of Calculus)

THEOREM 14

(Mean value theorem) Suppose f isin C 1[a b] x isin [a b] and y isin [a b] (andwithout loss of generality x le y) Then there is a c isin [x y] sube [a b] such that

f (y) minus f (x) = f (c)(y minus x)

Example 14

Suppose f (1) = 1 and |f (x)| le 2 for x isin [1 2] What are an upper boundand a lower bound on f (2)

8132019 num chap 1

6 Applied Numerical Methods

The mean value theorem tells us that

f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)

for some c isin (1 2) Furthermore the fact |f (x)| le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives

1 minus 2 = minus1 le f (2) le 1 + 2 = 3

112 Big ldquoOrdquo Notation

We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get

eh

minus(1 + h) =

1

2h2eξ

where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowly-varying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality

DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that

E (h)

leChk

for all |h| le

The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo

Example 15

E (h) = eh minus h minus 1 Then E (h) = O(h2)

PROOF By Taylorrsquos Theorem

eh = e0 + e0(hminus

0) + h2

2 eξ

for some c between 0 and h Thus

E (h) = eh minus 1 minus h le h2

e1

2

and E (h) ge 0

8132019 num chap 1

Mathematical Review and Computer Arithmetic 7

for h le 1 that is = 1 and C = e2 work

Example 16

Show that917501917501f (x+h)minusf (x)

h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f

has two continuous derivatives at each point in [a b]

PROOF

917501917501917501917501

f (x + h) minus f (x)

h minus f (x)

917501917501917501917501

=

917501917501917501917501917501917501917501917501917501f (x) + f (x)h +

x+h

x

(x + h minus t)f (t)dt minus f (x)

h minus f (x)

917501917501917501917501917501917501917501917501917501=

1

h

917501917501917501917501917501 x+h

x

(x + h minus t)f (t)dt

917501917501917501917501917501 le maxaletleb

|f (t)| h

2 = ch

113 Convergence Rates

DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that |xk+1 minusxlowast| le C |xk minusxlowast|α for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic

Example 17

A sequence sometimes learned in elementary classes for computing the squareroot of a number a is

xk+1 = xk

2 +

a

2xk

8132019 num chap 1

8 Applied Numerical Methods

We have

xk+1 minus radic a = xk

2 + a

= xk minus x2k minus a

a

a)xk +

2xk

1 minus xk +

2xk

2xk

= 1

2xk

(xk

minus

asymp 1

a)2

a thus showing that the convergence rate is quadratic

Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table

2)2

0 2 05858 times 100 mdash

1 15 08579 times 10minus1 02500

2 1416666666666667 02453 times 10minus2 03333

3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535

5 1414213562373095 02204 times 10minus17 mdash

In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the

multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2

explain how we computed the table at the end of this chapter)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 9

Example 18

As an example of linear convergence consider the iteration

xk+1 = xk minus x2k

35 +

2

35

2 We obtain the following table

k xk xk minus radic 2

0 2 05858 times 100 mdash

1 1428571428571429 01436 times 10minus1 02451 times 10minus1

2 1416909620991254 02696 times 10minus2 01878

3 1414728799831946 05152 times 10minus3

019114 1414312349239392 09879 times 10minus4 01917

5 1414232514607664 01895 times 10minus4 01918

6 1414217198786659 03636 times 10minus5 01919

7 1414214260116949 06955 times 10minus6 01919

8 1414213696254626 01339 times 10minus6 01919

19 1414213562373097 01554 times 10minus14 mdash

Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase-5 digit on each iteration

12 Computer Arithmetic

In numerical solution of mathematical problems two common types of errorare

1 Method (algorithm or truncation) error This is the error due to ap-proximations made in the numerical method

2 Rounding error This is the error made due to the finite number of digitsavailable on a computer

8132019 num chap 1

10 Applied Numerical Methods

Example 19

By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then

f (x) = f (x + h) minus f (x)

h +

1

h

x+h

x

f (t)(x + h minus t)dt

and

917501917501917501917501917501 1

h

x+h

x

f (t)(x + h minus t)dt

917501917501917501917501917501 le ch

Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations

Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small

using a calculator having 11 digits The following results were obtained

h ln(3 + h) minus ln(3)

h Error =

1

3 minus ln(3 + h) minus ln(3)

h = O(h)

10minus1 03278982 544 times10minus3

10minus2 0332779 554 times10minus4

10minus3 03332778 555 times10minus5

10minus4 0333328 533 times10minus6

10minus5 0333330 333 times10minus6

10minus6 0333300 333 times10minus5

10minus7 0333 333 times10minus4

10minus8 033 333 times10minus3

10minus9 03 333 times10minus2

10minus10

00 333 times10minus1

One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)

There are two possible ways to reduce rounding error

1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider

f (x) = f (x + h) minus f (x minus h)

2h + error where error is O(h2)

h ln(3 + h) minus ln(3 minus h)2h error

01 03334568 124 times10minus4

001 03333345 123 times10minus6

0001 03333333 191 times10minus8

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

2 Applied Numerical Methods

for which f (c) = k (Figure 11)

x

y

y = f (x)

m +

k +

M +

a+

c+

b+

FIGURE 11 Illustration of the Intermediate Value Theorem

Example 11

Consider f (x) = ex minus x minus 2 Using a computational device (such as a calcu-lator) on which we trust the approximation of ex to be accurate we computef (0) = minus1 and f (2) asymp 33891 We know f is continuous since it is a sumof continuous functions Since 0 is between f (0) and f (2) the IntermediateValue Theorem tells us there is a point c isin [0 2] such that f (c) = 0 At sucha c ec = c + 2

THEOREM 12

(Mean value theorem for integrals) Let f be continuous and w be Riemann integrable 1 on [a b] and suppose that w(x) ge 0 for x isin [a b] Then there exists a point c in [a b] such that

ba

w(x)f (x)dx = f (c)

ba

w(x)dx

Example 12

Suppose we want bounds on

1

0

x2eminusx2dx

1This means that the limit of the Riemann sums exists For example w may be continuousor w may have a finite number of breaks

8132019 num chap 1

Mathematical Review and Computer Arithmetic 3

With w(x) = x2 and f (x) = eminusx2 the Mean Value Theorem for integrals tells

us that 10

x2eminusx2dx = eminusc2 10

x2dx

for some c isin [0 1] so

1

3e le 10

x2eminusx2dx = eminusc2 10

x2dx le 1

3

The following is extremely important in scientific and engineering comput-ing

THEOREM 13

(Taylorrsquos theorem) Suppose that f isin C n+1[a b] Let x0 isin [a b] Then for any x isin [a b]

f (x) = P n(x) + Rn(x) where

P n(x) = f (x0) + f (x0)(x minus x0) + middot middot middot + f (n)(x0)(x minus x0)n

n

=

n1048573k=0

1

kf (k)(x0)(x minus x0)k and

Rn(x) = 1

n

xx0

f (n+1)(t)(x minus t)ndt (integral form of remainder)

Furthermore there is a ξ = ξ (x) between x0 and x with

Rn(x) = f (n+1)(ξ (x))(x minus x0)n+1

(n + 1) (Lagrange form of remainder)

PROOF Recall the integration by parts formula1114109

udv = uv minus 1114109 vdu

8132019 num chap 1

4 Applied Numerical Methods

Thus

f (x) minus f (x0) =

xx0

f (t)dt (let u = f (t) v = t minus x dv = dt)

= f (x0)(x minus x0) +

xx0

(x minus t)f (t)dt

(let u = f (t) dv = (x minus t)dt)

= f (x0)(x minus x0) minus (x minus t)2

2 f (t)

917501917501917501917501917501x

x0

+

xx0

(x minus t)2

2 f (t)dt

= f (x0)(x minus x0) + (x minus x0)2

2 f (x0) +

xx0

(x minus t)2

2 f (t)dt

Continuing this procedure

f (x) = f (x0) + f (x0)(x minus x0) + (x minus x0)2

2 f (x0)

+ middot middot middot + (x minus x0)n

n f (n)(x0) +

xx0

(x minus t)n

n f (n+1)(t)dt

= P n(x) + Rn(x)

Now consider Rn(x) = xx0

(x minus t)

n

n f (n+1)(t)dt and assume that x0 lt x (same

argument if x0 gt x) Then by Theorem 12

Rn(x) = f (n+1)(ξ (x))

xx0

(x minus t)n

n dt = f (n+1)(ξ (x))

(x minus x0)n+1

(n + 1)

where ξ is between x0 and x and thus ξ = ξ (x)

Example 13

Approximate sin(x) by a polynomial p(x) such that

|sin(x)

minus p(x)

| le10minus16

for minus01 le x le 01

For Example 13 Taylor polynomials about x0 = 0 are appropriate sincethat is the center of the interval about which we wish to approximate Weobserve that the terms of even degree in such a polynomial are absent so for

8132019 num chap 1

Mathematical Review and Computer Arithmetic 5

n even Taylorrsquos theorem gives

n P n Rn

2 x minus x3

3 cos(c2)

4 x minus x3

3

x5

5 cos(c4)

6 x minus x3

3 +

x5

5 minus x7

7 cos(c6)

n mdash (minus1)n2 xn+1

(n + 1) cos(cn)

Observing that | cos(cn)| le 1 we see that

|Rn(x)| le |x|n+1

(n + 1)

We may thus form the following table

n bound on error Rn

2 167 times 10minus4

4 833 times 10minus8

6 198 times 10minus11

8 276 times 10minus15

10 251 times 10minus19

Thus a polynomial with the required accuracy for x isin [minus01 01] is

p(x) = x minus x3

3 +

x5

5 minus x7

7 +

x9

9

An important special case of Taylorrsquos theorem is obtained with n = 0 (thatis directly from the Fundamental Theorem of Calculus)

THEOREM 14

(Mean value theorem) Suppose f isin C 1[a b] x isin [a b] and y isin [a b] (andwithout loss of generality x le y) Then there is a c isin [x y] sube [a b] such that

f (y) minus f (x) = f (c)(y minus x)

Example 14

Suppose f (1) = 1 and |f (x)| le 2 for x isin [1 2] What are an upper boundand a lower bound on f (2)

8132019 num chap 1

6 Applied Numerical Methods

The mean value theorem tells us that

f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)

for some c isin (1 2) Furthermore the fact |f (x)| le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives

1 minus 2 = minus1 le f (2) le 1 + 2 = 3

112 Big ldquoOrdquo Notation

We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get

eh

minus(1 + h) =

1

2h2eξ

where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowly-varying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality

DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that

E (h)

leChk

for all |h| le

The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo

Example 15

E (h) = eh minus h minus 1 Then E (h) = O(h2)

PROOF By Taylorrsquos Theorem

eh = e0 + e0(hminus

0) + h2

2 eξ

for some c between 0 and h Thus

E (h) = eh minus 1 minus h le h2

e1

2

and E (h) ge 0

8132019 num chap 1

Mathematical Review and Computer Arithmetic 7

for h le 1 that is = 1 and C = e2 work

Example 16

Show that917501917501f (x+h)minusf (x)

h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f

has two continuous derivatives at each point in [a b]

PROOF

917501917501917501917501

f (x + h) minus f (x)

h minus f (x)

917501917501917501917501

=

917501917501917501917501917501917501917501917501917501f (x) + f (x)h +

x+h

x

(x + h minus t)f (t)dt minus f (x)

h minus f (x)

917501917501917501917501917501917501917501917501917501=

1

h

917501917501917501917501917501 x+h

x

(x + h minus t)f (t)dt

917501917501917501917501917501 le maxaletleb

|f (t)| h

2 = ch

113 Convergence Rates

DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that |xk+1 minusxlowast| le C |xk minusxlowast|α for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic

Example 17

A sequence sometimes learned in elementary classes for computing the squareroot of a number a is

xk+1 = xk

2 +

a

2xk

8132019 num chap 1

8 Applied Numerical Methods

We have

xk+1 minus radic a = xk

2 + a

= xk minus x2k minus a

a

a)xk +

2xk

1 minus xk +

2xk

2xk

= 1

2xk

(xk

minus

asymp 1

a)2

a thus showing that the convergence rate is quadratic

Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table

2)2

0 2 05858 times 100 mdash

1 15 08579 times 10minus1 02500

2 1416666666666667 02453 times 10minus2 03333

3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535

5 1414213562373095 02204 times 10minus17 mdash

In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the

multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2

explain how we computed the table at the end of this chapter)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 9

Example 18

As an example of linear convergence consider the iteration

xk+1 = xk minus x2k

35 +

2

35

2 We obtain the following table

k xk xk minus radic 2

0 2 05858 times 100 mdash

1 1428571428571429 01436 times 10minus1 02451 times 10minus1

2 1416909620991254 02696 times 10minus2 01878

3 1414728799831946 05152 times 10minus3

019114 1414312349239392 09879 times 10minus4 01917

5 1414232514607664 01895 times 10minus4 01918

6 1414217198786659 03636 times 10minus5 01919

7 1414214260116949 06955 times 10minus6 01919

8 1414213696254626 01339 times 10minus6 01919

19 1414213562373097 01554 times 10minus14 mdash

Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase-5 digit on each iteration

12 Computer Arithmetic

In numerical solution of mathematical problems two common types of errorare

1 Method (algorithm or truncation) error This is the error due to ap-proximations made in the numerical method

2 Rounding error This is the error made due to the finite number of digitsavailable on a computer

8132019 num chap 1

10 Applied Numerical Methods

Example 19

By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then

f (x) = f (x + h) minus f (x)

h +

1

h

x+h

x

f (t)(x + h minus t)dt

and

917501917501917501917501917501 1

h

x+h

x

f (t)(x + h minus t)dt

917501917501917501917501917501 le ch

Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations

Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small

using a calculator having 11 digits The following results were obtained

h ln(3 + h) minus ln(3)

h Error =

1

3 minus ln(3 + h) minus ln(3)

h = O(h)

10minus1 03278982 544 times10minus3

10minus2 0332779 554 times10minus4

10minus3 03332778 555 times10minus5

10minus4 0333328 533 times10minus6

10minus5 0333330 333 times10minus6

10minus6 0333300 333 times10minus5

10minus7 0333 333 times10minus4

10minus8 033 333 times10minus3

10minus9 03 333 times10minus2

10minus10

00 333 times10minus1

One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)

There are two possible ways to reduce rounding error

1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider

f (x) = f (x + h) minus f (x minus h)

2h + error where error is O(h2)

h ln(3 + h) minus ln(3 minus h)2h error

01 03334568 124 times10minus4

001 03333345 123 times10minus6

0001 03333333 191 times10minus8

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

Mathematical Review and Computer Arithmetic 3

With w(x) = x2 and f (x) = eminusx2 the Mean Value Theorem for integrals tells

us that 10

x2eminusx2dx = eminusc2 10

x2dx

for some c isin [0 1] so

1

3e le 10

x2eminusx2dx = eminusc2 10

x2dx le 1

3

The following is extremely important in scientific and engineering comput-ing

THEOREM 13

(Taylorrsquos theorem) Suppose that f isin C n+1[a b] Let x0 isin [a b] Then for any x isin [a b]

f (x) = P n(x) + Rn(x) where

P n(x) = f (x0) + f (x0)(x minus x0) + middot middot middot + f (n)(x0)(x minus x0)n

n

=

n1048573k=0

1

kf (k)(x0)(x minus x0)k and

Rn(x) = 1

n

xx0

f (n+1)(t)(x minus t)ndt (integral form of remainder)

Furthermore there is a ξ = ξ (x) between x0 and x with

Rn(x) = f (n+1)(ξ (x))(x minus x0)n+1

(n + 1) (Lagrange form of remainder)

PROOF Recall the integration by parts formula1114109

udv = uv minus 1114109 vdu

8132019 num chap 1

4 Applied Numerical Methods

Thus

f (x) minus f (x0) =

xx0

f (t)dt (let u = f (t) v = t minus x dv = dt)

= f (x0)(x minus x0) +

xx0

(x minus t)f (t)dt

(let u = f (t) dv = (x minus t)dt)

= f (x0)(x minus x0) minus (x minus t)2

2 f (t)

917501917501917501917501917501x

x0

+

xx0

(x minus t)2

2 f (t)dt

= f (x0)(x minus x0) + (x minus x0)2

2 f (x0) +

xx0

(x minus t)2

2 f (t)dt

Continuing this procedure

f (x) = f (x0) + f (x0)(x minus x0) + (x minus x0)2

2 f (x0)

+ middot middot middot + (x minus x0)n

n f (n)(x0) +

xx0

(x minus t)n

n f (n+1)(t)dt

= P n(x) + Rn(x)

Now consider Rn(x) = xx0

(x minus t)

n

n f (n+1)(t)dt and assume that x0 lt x (same

argument if x0 gt x) Then by Theorem 12

Rn(x) = f (n+1)(ξ (x))

xx0

(x minus t)n

n dt = f (n+1)(ξ (x))

(x minus x0)n+1

(n + 1)

where ξ is between x0 and x and thus ξ = ξ (x)

Example 13

Approximate sin(x) by a polynomial p(x) such that

|sin(x)

minus p(x)

| le10minus16

for minus01 le x le 01

For Example 13 Taylor polynomials about x0 = 0 are appropriate sincethat is the center of the interval about which we wish to approximate Weobserve that the terms of even degree in such a polynomial are absent so for

8132019 num chap 1

Mathematical Review and Computer Arithmetic 5

n even Taylorrsquos theorem gives

n P n Rn

2 x minus x3

3 cos(c2)

4 x minus x3

3

x5

5 cos(c4)

6 x minus x3

3 +

x5

5 minus x7

7 cos(c6)

n mdash (minus1)n2 xn+1

(n + 1) cos(cn)

Observing that | cos(cn)| le 1 we see that

|Rn(x)| le |x|n+1

(n + 1)

We may thus form the following table

n bound on error Rn

2 167 times 10minus4

4 833 times 10minus8

6 198 times 10minus11

8 276 times 10minus15

10 251 times 10minus19

Thus a polynomial with the required accuracy for x isin [minus01 01] is

p(x) = x minus x3

3 +

x5

5 minus x7

7 +

x9

9

An important special case of Taylorrsquos theorem is obtained with n = 0 (thatis directly from the Fundamental Theorem of Calculus)

THEOREM 14

(Mean value theorem) Suppose f isin C 1[a b] x isin [a b] and y isin [a b] (andwithout loss of generality x le y) Then there is a c isin [x y] sube [a b] such that

f (y) minus f (x) = f (c)(y minus x)

Example 14

Suppose f (1) = 1 and |f (x)| le 2 for x isin [1 2] What are an upper boundand a lower bound on f (2)

8132019 num chap 1

6 Applied Numerical Methods

The mean value theorem tells us that

f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)

for some c isin (1 2) Furthermore the fact |f (x)| le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives

1 minus 2 = minus1 le f (2) le 1 + 2 = 3

112 Big ldquoOrdquo Notation

We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get

eh

minus(1 + h) =

1

2h2eξ

where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowly-varying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality

DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that

E (h)

leChk

for all |h| le

The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo

Example 15

E (h) = eh minus h minus 1 Then E (h) = O(h2)

PROOF By Taylorrsquos Theorem

eh = e0 + e0(hminus

0) + h2

2 eξ

for some c between 0 and h Thus

E (h) = eh minus 1 minus h le h2

e1

2

and E (h) ge 0

8132019 num chap 1

Mathematical Review and Computer Arithmetic 7

for h le 1 that is = 1 and C = e2 work

Example 16

Show that917501917501f (x+h)minusf (x)

h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f

has two continuous derivatives at each point in [a b]

PROOF

917501917501917501917501

f (x + h) minus f (x)

h minus f (x)

917501917501917501917501

=

917501917501917501917501917501917501917501917501917501f (x) + f (x)h +

x+h

x

(x + h minus t)f (t)dt minus f (x)

h minus f (x)

917501917501917501917501917501917501917501917501917501=

1

h

917501917501917501917501917501 x+h

x

(x + h minus t)f (t)dt

917501917501917501917501917501 le maxaletleb

|f (t)| h

2 = ch

113 Convergence Rates

DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that |xk+1 minusxlowast| le C |xk minusxlowast|α for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic

Example 17

A sequence sometimes learned in elementary classes for computing the squareroot of a number a is

xk+1 = xk

2 +

a

2xk

8132019 num chap 1

8 Applied Numerical Methods

We have

xk+1 minus radic a = xk

2 + a

= xk minus x2k minus a

a

a)xk +

2xk

1 minus xk +

2xk

2xk

= 1

2xk

(xk

minus

asymp 1

a)2

a thus showing that the convergence rate is quadratic

Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table

2)2

0 2 05858 times 100 mdash

1 15 08579 times 10minus1 02500

2 1416666666666667 02453 times 10minus2 03333

3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535

5 1414213562373095 02204 times 10minus17 mdash

In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the

multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2

explain how we computed the table at the end of this chapter)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 9

Example 18

As an example of linear convergence consider the iteration

xk+1 = xk minus x2k

35 +

2

35

2 We obtain the following table

k xk xk minus radic 2

0 2 05858 times 100 mdash

1 1428571428571429 01436 times 10minus1 02451 times 10minus1

2 1416909620991254 02696 times 10minus2 01878

3 1414728799831946 05152 times 10minus3

019114 1414312349239392 09879 times 10minus4 01917

5 1414232514607664 01895 times 10minus4 01918

6 1414217198786659 03636 times 10minus5 01919

7 1414214260116949 06955 times 10minus6 01919

8 1414213696254626 01339 times 10minus6 01919

19 1414213562373097 01554 times 10minus14 mdash

Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase-5 digit on each iteration

12 Computer Arithmetic

In numerical solution of mathematical problems two common types of errorare

1 Method (algorithm or truncation) error This is the error due to ap-proximations made in the numerical method

2 Rounding error This is the error made due to the finite number of digitsavailable on a computer

8132019 num chap 1

10 Applied Numerical Methods

Example 19

By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then

f (x) = f (x + h) minus f (x)

h +

1

h

x+h

x

f (t)(x + h minus t)dt

and

917501917501917501917501917501 1

h

x+h

x

f (t)(x + h minus t)dt

917501917501917501917501917501 le ch

Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations

Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small

using a calculator having 11 digits The following results were obtained

h ln(3 + h) minus ln(3)

h Error =

1

3 minus ln(3 + h) minus ln(3)

h = O(h)

10minus1 03278982 544 times10minus3

10minus2 0332779 554 times10minus4

10minus3 03332778 555 times10minus5

10minus4 0333328 533 times10minus6

10minus5 0333330 333 times10minus6

10minus6 0333300 333 times10minus5

10minus7 0333 333 times10minus4

10minus8 033 333 times10minus3

10minus9 03 333 times10minus2

10minus10

00 333 times10minus1

One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)

There are two possible ways to reduce rounding error

1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider

f (x) = f (x + h) minus f (x minus h)

2h + error where error is O(h2)

h ln(3 + h) minus ln(3 minus h)2h error

01 03334568 124 times10minus4

001 03333345 123 times10minus6

0001 03333333 191 times10minus8

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

4 Applied Numerical Methods

Thus

f (x) minus f (x0) =

xx0

f (t)dt (let u = f (t) v = t minus x dv = dt)

= f (x0)(x minus x0) +

xx0

(x minus t)f (t)dt

(let u = f (t) dv = (x minus t)dt)

= f (x0)(x minus x0) minus (x minus t)2

2 f (t)

917501917501917501917501917501x

x0

+

xx0

(x minus t)2

2 f (t)dt

= f (x0)(x minus x0) + (x minus x0)2

2 f (x0) +

xx0

(x minus t)2

2 f (t)dt

Continuing this procedure

f (x) = f (x0) + f (x0)(x minus x0) + (x minus x0)2

2 f (x0)

+ middot middot middot + (x minus x0)n

n f (n)(x0) +

xx0

(x minus t)n

n f (n+1)(t)dt

= P n(x) + Rn(x)

Now consider Rn(x) = xx0

(x minus t)

n

n f (n+1)(t)dt and assume that x0 lt x (same

argument if x0 gt x) Then by Theorem 12

Rn(x) = f (n+1)(ξ (x))

xx0

(x minus t)n

n dt = f (n+1)(ξ (x))

(x minus x0)n+1

(n + 1)

where ξ is between x0 and x and thus ξ = ξ (x)

Example 13

Approximate sin(x) by a polynomial p(x) such that

|sin(x)

minus p(x)

| le10minus16

for minus01 le x le 01

For Example 13 Taylor polynomials about x0 = 0 are appropriate sincethat is the center of the interval about which we wish to approximate Weobserve that the terms of even degree in such a polynomial are absent so for

8132019 num chap 1

Mathematical Review and Computer Arithmetic 5

n even Taylorrsquos theorem gives

n P n Rn

2 x minus x3

3 cos(c2)

4 x minus x3

3

x5

5 cos(c4)

6 x minus x3

3 +

x5

5 minus x7

7 cos(c6)

n mdash (minus1)n2 xn+1

(n + 1) cos(cn)

Observing that | cos(cn)| le 1 we see that

|Rn(x)| le |x|n+1

(n + 1)

We may thus form the following table

n bound on error Rn

2 167 times 10minus4

4 833 times 10minus8

6 198 times 10minus11

8 276 times 10minus15

10 251 times 10minus19

Thus a polynomial with the required accuracy for x isin [minus01 01] is

p(x) = x minus x3

3 +

x5

5 minus x7

7 +

x9

9

An important special case of Taylorrsquos theorem is obtained with n = 0 (thatis directly from the Fundamental Theorem of Calculus)

THEOREM 14

(Mean value theorem) Suppose f isin C 1[a b] x isin [a b] and y isin [a b] (andwithout loss of generality x le y) Then there is a c isin [x y] sube [a b] such that

f (y) minus f (x) = f (c)(y minus x)

Example 14

Suppose f (1) = 1 and |f (x)| le 2 for x isin [1 2] What are an upper boundand a lower bound on f (2)

8132019 num chap 1

6 Applied Numerical Methods

The mean value theorem tells us that

f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)

for some c isin (1 2) Furthermore the fact |f (x)| le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives

1 minus 2 = minus1 le f (2) le 1 + 2 = 3

112 Big ldquoOrdquo Notation

We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get

eh

minus(1 + h) =

1

2h2eξ

where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowly-varying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality

DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that

E (h)

leChk

for all |h| le

The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo

Example 15

E (h) = eh minus h minus 1 Then E (h) = O(h2)

PROOF By Taylorrsquos Theorem

eh = e0 + e0(hminus

0) + h2

2 eξ

for some c between 0 and h Thus

E (h) = eh minus 1 minus h le h2

e1

2

and E (h) ge 0

8132019 num chap 1

Mathematical Review and Computer Arithmetic 7

for h le 1 that is = 1 and C = e2 work

Example 16

Show that917501917501f (x+h)minusf (x)

h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f

has two continuous derivatives at each point in [a b]

PROOF

917501917501917501917501

f (x + h) minus f (x)

h minus f (x)

917501917501917501917501

=

917501917501917501917501917501917501917501917501917501f (x) + f (x)h +

x+h

x

(x + h minus t)f (t)dt minus f (x)

h minus f (x)

917501917501917501917501917501917501917501917501917501=

1

h

917501917501917501917501917501 x+h

x

(x + h minus t)f (t)dt

917501917501917501917501917501 le maxaletleb

|f (t)| h

2 = ch

113 Convergence Rates

DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that |xk+1 minusxlowast| le C |xk minusxlowast|α for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic

Example 17

A sequence sometimes learned in elementary classes for computing the squareroot of a number a is

xk+1 = xk

2 +

a

2xk

8132019 num chap 1

8 Applied Numerical Methods

We have

xk+1 minus radic a = xk

2 + a

= xk minus x2k minus a

a

a)xk +

2xk

1 minus xk +

2xk

2xk

= 1

2xk

(xk

minus

asymp 1

a)2

a thus showing that the convergence rate is quadratic

Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table

2)2

0 2 05858 times 100 mdash

1 15 08579 times 10minus1 02500

2 1416666666666667 02453 times 10minus2 03333

3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535

5 1414213562373095 02204 times 10minus17 mdash

In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the

multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2

explain how we computed the table at the end of this chapter)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 9

Example 18

As an example of linear convergence consider the iteration

xk+1 = xk minus x2k

35 +

2

35

2 We obtain the following table

k xk xk minus radic 2

0 2 05858 times 100 mdash

1 1428571428571429 01436 times 10minus1 02451 times 10minus1

2 1416909620991254 02696 times 10minus2 01878

3 1414728799831946 05152 times 10minus3

019114 1414312349239392 09879 times 10minus4 01917

5 1414232514607664 01895 times 10minus4 01918

6 1414217198786659 03636 times 10minus5 01919

7 1414214260116949 06955 times 10minus6 01919

8 1414213696254626 01339 times 10minus6 01919

19 1414213562373097 01554 times 10minus14 mdash

Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase-5 digit on each iteration

12 Computer Arithmetic

In numerical solution of mathematical problems two common types of errorare

1 Method (algorithm or truncation) error This is the error due to ap-proximations made in the numerical method

2 Rounding error This is the error made due to the finite number of digitsavailable on a computer

8132019 num chap 1

10 Applied Numerical Methods

Example 19

By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then

f (x) = f (x + h) minus f (x)

h +

1

h

x+h

x

f (t)(x + h minus t)dt

and

917501917501917501917501917501 1

h

x+h

x

f (t)(x + h minus t)dt

917501917501917501917501917501 le ch

Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations

Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small

using a calculator having 11 digits The following results were obtained

h ln(3 + h) minus ln(3)

h Error =

1

3 minus ln(3 + h) minus ln(3)

h = O(h)

10minus1 03278982 544 times10minus3

10minus2 0332779 554 times10minus4

10minus3 03332778 555 times10minus5

10minus4 0333328 533 times10minus6

10minus5 0333330 333 times10minus6

10minus6 0333300 333 times10minus5

10minus7 0333 333 times10minus4

10minus8 033 333 times10minus3

10minus9 03 333 times10minus2

10minus10

00 333 times10minus1

One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)

There are two possible ways to reduce rounding error

1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider

f (x) = f (x + h) minus f (x minus h)

2h + error where error is O(h2)

h ln(3 + h) minus ln(3 minus h)2h error

01 03334568 124 times10minus4

001 03333345 123 times10minus6

0001 03333333 191 times10minus8

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

Mathematical Review and Computer Arithmetic 5

n even Taylorrsquos theorem gives

n P n Rn

2 x minus x3

3 cos(c2)

4 x minus x3

3

x5

5 cos(c4)

6 x minus x3

3 +

x5

5 minus x7

7 cos(c6)

n mdash (minus1)n2 xn+1

(n + 1) cos(cn)

Observing that | cos(cn)| le 1 we see that

|Rn(x)| le |x|n+1

(n + 1)

We may thus form the following table

n bound on error Rn

2 167 times 10minus4

4 833 times 10minus8

6 198 times 10minus11

8 276 times 10minus15

10 251 times 10minus19

Thus a polynomial with the required accuracy for x isin [minus01 01] is

p(x) = x minus x3

3 +

x5

5 minus x7

7 +

x9

9

An important special case of Taylorrsquos theorem is obtained with n = 0 (thatis directly from the Fundamental Theorem of Calculus)

THEOREM 14

(Mean value theorem) Suppose f isin C 1[a b] x isin [a b] and y isin [a b] (andwithout loss of generality x le y) Then there is a c isin [x y] sube [a b] such that

f (y) minus f (x) = f (c)(y minus x)

Example 14

Suppose f (1) = 1 and |f (x)| le 2 for x isin [1 2] What are an upper boundand a lower bound on f (2)

8132019 num chap 1

6 Applied Numerical Methods

The mean value theorem tells us that

f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)

for some c isin (1 2) Furthermore the fact |f (x)| le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives

1 minus 2 = minus1 le f (2) le 1 + 2 = 3

112 Big ldquoOrdquo Notation

We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get

eh

minus(1 + h) =

1

2h2eξ

where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowly-varying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality

DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that

E (h)

leChk

for all |h| le

The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo

Example 15

E (h) = eh minus h minus 1 Then E (h) = O(h2)

PROOF By Taylorrsquos Theorem

eh = e0 + e0(hminus

0) + h2

2 eξ

for some c between 0 and h Thus

E (h) = eh minus 1 minus h le h2

e1

2

and E (h) ge 0

8132019 num chap 1

Mathematical Review and Computer Arithmetic 7

for h le 1 that is = 1 and C = e2 work

Example 16

Show that917501917501f (x+h)minusf (x)

h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f

has two continuous derivatives at each point in [a b]

PROOF

917501917501917501917501

f (x + h) minus f (x)

h minus f (x)

917501917501917501917501

=

917501917501917501917501917501917501917501917501917501f (x) + f (x)h +

x+h

x

(x + h minus t)f (t)dt minus f (x)

h minus f (x)

917501917501917501917501917501917501917501917501917501=

1

h

917501917501917501917501917501 x+h

x

(x + h minus t)f (t)dt

917501917501917501917501917501 le maxaletleb

|f (t)| h

2 = ch

113 Convergence Rates

DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that |xk+1 minusxlowast| le C |xk minusxlowast|α for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic

Example 17

A sequence sometimes learned in elementary classes for computing the squareroot of a number a is

xk+1 = xk

2 +

a

2xk

8132019 num chap 1

8 Applied Numerical Methods

We have

xk+1 minus radic a = xk

2 + a

= xk minus x2k minus a

a

a)xk +

2xk

1 minus xk +

2xk

2xk

= 1

2xk

(xk

minus

asymp 1

a)2

a thus showing that the convergence rate is quadratic

Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table

2)2

0 2 05858 times 100 mdash

1 15 08579 times 10minus1 02500

2 1416666666666667 02453 times 10minus2 03333

3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535

5 1414213562373095 02204 times 10minus17 mdash

In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the

multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2

explain how we computed the table at the end of this chapter)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 9

Example 18

As an example of linear convergence consider the iteration

xk+1 = xk minus x2k

35 +

2

35

2 We obtain the following table

k xk xk minus radic 2

0 2 05858 times 100 mdash

1 1428571428571429 01436 times 10minus1 02451 times 10minus1

2 1416909620991254 02696 times 10minus2 01878

3 1414728799831946 05152 times 10minus3

019114 1414312349239392 09879 times 10minus4 01917

5 1414232514607664 01895 times 10minus4 01918

6 1414217198786659 03636 times 10minus5 01919

7 1414214260116949 06955 times 10minus6 01919

8 1414213696254626 01339 times 10minus6 01919

19 1414213562373097 01554 times 10minus14 mdash

Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase-5 digit on each iteration

12 Computer Arithmetic

In numerical solution of mathematical problems two common types of errorare

1 Method (algorithm or truncation) error This is the error due to ap-proximations made in the numerical method

2 Rounding error This is the error made due to the finite number of digitsavailable on a computer

8132019 num chap 1

10 Applied Numerical Methods

Example 19

By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then

f (x) = f (x + h) minus f (x)

h +

1

h

x+h

x

f (t)(x + h minus t)dt

and

917501917501917501917501917501 1

h

x+h

x

f (t)(x + h minus t)dt

917501917501917501917501917501 le ch

Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations

Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small

using a calculator having 11 digits The following results were obtained

h ln(3 + h) minus ln(3)

h Error =

1

3 minus ln(3 + h) minus ln(3)

h = O(h)

10minus1 03278982 544 times10minus3

10minus2 0332779 554 times10minus4

10minus3 03332778 555 times10minus5

10minus4 0333328 533 times10minus6

10minus5 0333330 333 times10minus6

10minus6 0333300 333 times10minus5

10minus7 0333 333 times10minus4

10minus8 033 333 times10minus3

10minus9 03 333 times10minus2

10minus10

00 333 times10minus1

One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)

There are two possible ways to reduce rounding error

1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider

f (x) = f (x + h) minus f (x minus h)

2h + error where error is O(h2)

h ln(3 + h) minus ln(3 minus h)2h error

01 03334568 124 times10minus4

001 03333345 123 times10minus6

0001 03333333 191 times10minus8

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

6 Applied Numerical Methods

The mean value theorem tells us that

f (2) = f (1) + f (c)(2 minus 1) = f (1) + f (c)

for some c isin (1 2) Furthermore the fact |f (x)| le 2 is equivalent to minus2 lef (x) le 2 Combining these facts gives

1 minus 2 = minus1 le f (2) le 1 + 2 = 3

112 Big ldquoOrdquo Notation

We study ldquorates of growthrdquo and ldquorates of decreaserdquo of errors For exampleif we approximate eh by a first degree Taylor polynomial about x = 0 we get

eh

minus(1 + h) =

1

2h2eξ

where ξ is some unknown quantity between 0 and h Although we donrsquotknow exactly what eξ is we know that it is nearly constant (in this caseapproximately 1) for h near 0 so the error ehminus(1 + h) is roughly proportionalto h2 for h small This approximate proportionality is often more important toknow than the slowly-varying constant eξ The big ldquoOrdquo and little ldquoordquo notationare used to describe and keep track of this approximate proportionality

DEFINITION 11 Let E (h) be an expression that depends on a small quantity h We say that E (h) = O(hk) if there are an and C such that

E (h)

leChk

for all |h| le

The ldquoOrdquo denotes ldquoorderrdquo For example if f (h) = O(h2) we say that ldquof exhibits order 2 convergence to 0 as h tends to 0rdquo

Example 15

E (h) = eh minus h minus 1 Then E (h) = O(h2)

PROOF By Taylorrsquos Theorem

eh = e0 + e0(hminus

0) + h2

2 eξ

for some c between 0 and h Thus

E (h) = eh minus 1 minus h le h2

e1

2

and E (h) ge 0

8132019 num chap 1

Mathematical Review and Computer Arithmetic 7

for h le 1 that is = 1 and C = e2 work

Example 16

Show that917501917501f (x+h)minusf (x)

h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f

has two continuous derivatives at each point in [a b]

PROOF

917501917501917501917501

f (x + h) minus f (x)

h minus f (x)

917501917501917501917501

=

917501917501917501917501917501917501917501917501917501f (x) + f (x)h +

x+h

x

(x + h minus t)f (t)dt minus f (x)

h minus f (x)

917501917501917501917501917501917501917501917501917501=

1

h

917501917501917501917501917501 x+h

x

(x + h minus t)f (t)dt

917501917501917501917501917501 le maxaletleb

|f (t)| h

2 = ch

113 Convergence Rates

DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that |xk+1 minusxlowast| le C |xk minusxlowast|α for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic

Example 17

A sequence sometimes learned in elementary classes for computing the squareroot of a number a is

xk+1 = xk

2 +

a

2xk

8132019 num chap 1

8 Applied Numerical Methods

We have

xk+1 minus radic a = xk

2 + a

= xk minus x2k minus a

a

a)xk +

2xk

1 minus xk +

2xk

2xk

= 1

2xk

(xk

minus

asymp 1

a)2

a thus showing that the convergence rate is quadratic

Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table

2)2

0 2 05858 times 100 mdash

1 15 08579 times 10minus1 02500

2 1416666666666667 02453 times 10minus2 03333

3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535

5 1414213562373095 02204 times 10minus17 mdash

In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the

multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2

explain how we computed the table at the end of this chapter)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 9

Example 18

As an example of linear convergence consider the iteration

xk+1 = xk minus x2k

35 +

2

35

2 We obtain the following table

k xk xk minus radic 2

0 2 05858 times 100 mdash

1 1428571428571429 01436 times 10minus1 02451 times 10minus1

2 1416909620991254 02696 times 10minus2 01878

3 1414728799831946 05152 times 10minus3

019114 1414312349239392 09879 times 10minus4 01917

5 1414232514607664 01895 times 10minus4 01918

6 1414217198786659 03636 times 10minus5 01919

7 1414214260116949 06955 times 10minus6 01919

8 1414213696254626 01339 times 10minus6 01919

19 1414213562373097 01554 times 10minus14 mdash

Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase-5 digit on each iteration

12 Computer Arithmetic

In numerical solution of mathematical problems two common types of errorare

1 Method (algorithm or truncation) error This is the error due to ap-proximations made in the numerical method

2 Rounding error This is the error made due to the finite number of digitsavailable on a computer

8132019 num chap 1

10 Applied Numerical Methods

Example 19

By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then

f (x) = f (x + h) minus f (x)

h +

1

h

x+h

x

f (t)(x + h minus t)dt

and

917501917501917501917501917501 1

h

x+h

x

f (t)(x + h minus t)dt

917501917501917501917501917501 le ch

Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations

Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small

using a calculator having 11 digits The following results were obtained

h ln(3 + h) minus ln(3)

h Error =

1

3 minus ln(3 + h) minus ln(3)

h = O(h)

10minus1 03278982 544 times10minus3

10minus2 0332779 554 times10minus4

10minus3 03332778 555 times10minus5

10minus4 0333328 533 times10minus6

10minus5 0333330 333 times10minus6

10minus6 0333300 333 times10minus5

10minus7 0333 333 times10minus4

10minus8 033 333 times10minus3

10minus9 03 333 times10minus2

10minus10

00 333 times10minus1

One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)

There are two possible ways to reduce rounding error

1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider

f (x) = f (x + h) minus f (x minus h)

2h + error where error is O(h2)

h ln(3 + h) minus ln(3 minus h)2h error

01 03334568 124 times10minus4

001 03333345 123 times10minus6

0001 03333333 191 times10minus8

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

Mathematical Review and Computer Arithmetic 7

for h le 1 that is = 1 and C = e2 work

Example 16

Show that917501917501f (x+h)minusf (x)

h minus f (x)917501917501= O(h) for x x + h isin [a b] assuming that f

has two continuous derivatives at each point in [a b]

PROOF

917501917501917501917501

f (x + h) minus f (x)

h minus f (x)

917501917501917501917501

=

917501917501917501917501917501917501917501917501917501f (x) + f (x)h +

x+h

x

(x + h minus t)f (t)dt minus f (x)

h minus f (x)

917501917501917501917501917501917501917501917501917501=

1

h

917501917501917501917501917501 x+h

x

(x + h minus t)f (t)dt

917501917501917501917501917501 le maxaletleb

|f (t)| h

2 = ch

113 Convergence Rates

DEFINITION 12 Let xk be a sequence with limit xlowast If there are constants C and α and an integer N such that |xk+1 minusxlowast| le C |xk minusxlowast|α for k ge N we say that the rate of convergence is of order at least α If α = 1(with C lt 1) the rate is said to be linear If α = 2 the rate is said to be quadratic

Example 17

A sequence sometimes learned in elementary classes for computing the squareroot of a number a is

xk+1 = xk

2 +

a

2xk

8132019 num chap 1

8 Applied Numerical Methods

We have

xk+1 minus radic a = xk

2 + a

= xk minus x2k minus a

a

a)xk +

2xk

1 minus xk +

2xk

2xk

= 1

2xk

(xk

minus

asymp 1

a)2

a thus showing that the convergence rate is quadratic

Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table

2)2

0 2 05858 times 100 mdash

1 15 08579 times 10minus1 02500

2 1416666666666667 02453 times 10minus2 03333

3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535

5 1414213562373095 02204 times 10minus17 mdash

In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the

multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2

explain how we computed the table at the end of this chapter)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 9

Example 18

As an example of linear convergence consider the iteration

xk+1 = xk minus x2k

35 +

2

35

2 We obtain the following table

k xk xk minus radic 2

0 2 05858 times 100 mdash

1 1428571428571429 01436 times 10minus1 02451 times 10minus1

2 1416909620991254 02696 times 10minus2 01878

3 1414728799831946 05152 times 10minus3

019114 1414312349239392 09879 times 10minus4 01917

5 1414232514607664 01895 times 10minus4 01918

6 1414217198786659 03636 times 10minus5 01919

7 1414214260116949 06955 times 10minus6 01919

8 1414213696254626 01339 times 10minus6 01919

19 1414213562373097 01554 times 10minus14 mdash

Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase-5 digit on each iteration

12 Computer Arithmetic

In numerical solution of mathematical problems two common types of errorare

1 Method (algorithm or truncation) error This is the error due to ap-proximations made in the numerical method

2 Rounding error This is the error made due to the finite number of digitsavailable on a computer

8132019 num chap 1

10 Applied Numerical Methods

Example 19

By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then

f (x) = f (x + h) minus f (x)

h +

1

h

x+h

x

f (t)(x + h minus t)dt

and

917501917501917501917501917501 1

h

x+h

x

f (t)(x + h minus t)dt

917501917501917501917501917501 le ch

Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations

Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small

using a calculator having 11 digits The following results were obtained

h ln(3 + h) minus ln(3)

h Error =

1

3 minus ln(3 + h) minus ln(3)

h = O(h)

10minus1 03278982 544 times10minus3

10minus2 0332779 554 times10minus4

10minus3 03332778 555 times10minus5

10minus4 0333328 533 times10minus6

10minus5 0333330 333 times10minus6

10minus6 0333300 333 times10minus5

10minus7 0333 333 times10minus4

10minus8 033 333 times10minus3

10minus9 03 333 times10minus2

10minus10

00 333 times10minus1

One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)

There are two possible ways to reduce rounding error

1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider

f (x) = f (x + h) minus f (x minus h)

2h + error where error is O(h2)

h ln(3 + h) minus ln(3 minus h)2h error

01 03334568 124 times10minus4

001 03333345 123 times10minus6

0001 03333333 191 times10minus8

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

8 Applied Numerical Methods

We have

xk+1 minus radic a = xk

2 + a

= xk minus x2k minus a

a

a)xk +

2xk

1 minus xk +

2xk

2xk

= 1

2xk

(xk

minus

asymp 1

a)2

a thus showing that the convergence rate is quadratic

Quadratic convergence is very fast We can think of quadratic convergencewith C asymp 1 as doubling the number of significant figures on each iteration (Incontrast linear convergence with C = 01 adds one decimal digit of accuracyto the approximation on each iteration) For example if we use the squareroot computation from Example 17 with a = 2 and starting with x0 = 2 weobtain the following table

2)2

0 2 05858 times 100 mdash

1 15 08579 times 10minus1 02500

2 1416666666666667 02453 times 10minus2 03333

3 1414215686274510 02123 times 10minus6 035294 1414213562374690 01594 times 10minus13 03535

5 1414213562373095 02204 times 10minus17 mdash

In this table the correct digits are underlined This table illustrates thatthe total number of digits more than doubles on each iteration In fact the

multiplying factor C for the quadratic convergence appears to be approaching03535 (The last error ratio is not meaningful in this sense because onlyroughly 16 digits were carried in the computation) Based on our analysisthe limiting value of C should be about 1(2

explain how we computed the table at the end of this chapter)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 9

Example 18

As an example of linear convergence consider the iteration

xk+1 = xk minus x2k

35 +

2

35

2 We obtain the following table

k xk xk minus radic 2

0 2 05858 times 100 mdash

1 1428571428571429 01436 times 10minus1 02451 times 10minus1

2 1416909620991254 02696 times 10minus2 01878

3 1414728799831946 05152 times 10minus3

019114 1414312349239392 09879 times 10minus4 01917

5 1414232514607664 01895 times 10minus4 01918

6 1414217198786659 03636 times 10minus5 01919

7 1414214260116949 06955 times 10minus6 01919

8 1414213696254626 01339 times 10minus6 01919

19 1414213562373097 01554 times 10minus14 mdash

Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase-5 digit on each iteration

12 Computer Arithmetic

In numerical solution of mathematical problems two common types of errorare

1 Method (algorithm or truncation) error This is the error due to ap-proximations made in the numerical method

2 Rounding error This is the error made due to the finite number of digitsavailable on a computer

8132019 num chap 1

10 Applied Numerical Methods

Example 19

By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then

f (x) = f (x + h) minus f (x)

h +

1

h

x+h

x

f (t)(x + h minus t)dt

and

917501917501917501917501917501 1

h

x+h

x

f (t)(x + h minus t)dt

917501917501917501917501917501 le ch

Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations

Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small

using a calculator having 11 digits The following results were obtained

h ln(3 + h) minus ln(3)

h Error =

1

3 minus ln(3 + h) minus ln(3)

h = O(h)

10minus1 03278982 544 times10minus3

10minus2 0332779 554 times10minus4

10minus3 03332778 555 times10minus5

10minus4 0333328 533 times10minus6

10minus5 0333330 333 times10minus6

10minus6 0333300 333 times10minus5

10minus7 0333 333 times10minus4

10minus8 033 333 times10minus3

10minus9 03 333 times10minus2

10minus10

00 333 times10minus1

One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)

There are two possible ways to reduce rounding error

1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider

f (x) = f (x + h) minus f (x minus h)

2h + error where error is O(h2)

h ln(3 + h) minus ln(3 minus h)2h error

01 03334568 124 times10minus4

001 03333345 123 times10minus6

0001 03333333 191 times10minus8

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

Mathematical Review and Computer Arithmetic 9

Example 18

As an example of linear convergence consider the iteration

xk+1 = xk minus x2k

35 +

2

35

2 We obtain the following table

k xk xk minus radic 2

0 2 05858 times 100 mdash

1 1428571428571429 01436 times 10minus1 02451 times 10minus1

2 1416909620991254 02696 times 10minus2 01878

3 1414728799831946 05152 times 10minus3

019114 1414312349239392 09879 times 10minus4 01917

5 1414232514607664 01895 times 10minus4 01918

6 1414217198786659 03636 times 10minus5 01919

7 1414214260116949 06955 times 10minus6 01919

8 1414213696254626 01339 times 10minus6 01919

19 1414213562373097 01554 times 10minus14 mdash

Here the constant C in the linear convergence to four significant digitsappears to be 01919 asymp 15 That is the error is reduced by approximately afactor of 5 each iteration We can think of this as obtaining one more correctbase-5 digit on each iteration

12 Computer Arithmetic

In numerical solution of mathematical problems two common types of errorare

1 Method (algorithm or truncation) error This is the error due to ap-proximations made in the numerical method

2 Rounding error This is the error made due to the finite number of digitsavailable on a computer

8132019 num chap 1

10 Applied Numerical Methods

Example 19

By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then

f (x) = f (x + h) minus f (x)

h +

1

h

x+h

x

f (t)(x + h minus t)dt

and

917501917501917501917501917501 1

h

x+h

x

f (t)(x + h minus t)dt

917501917501917501917501917501 le ch

Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations

Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small

using a calculator having 11 digits The following results were obtained

h ln(3 + h) minus ln(3)

h Error =

1

3 minus ln(3 + h) minus ln(3)

h = O(h)

10minus1 03278982 544 times10minus3

10minus2 0332779 554 times10minus4

10minus3 03332778 555 times10minus5

10minus4 0333328 533 times10minus6

10minus5 0333330 333 times10minus6

10minus6 0333300 333 times10minus5

10minus7 0333 333 times10minus4

10minus8 033 333 times10minus3

10minus9 03 333 times10minus2

10minus10

00 333 times10minus1

One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)

There are two possible ways to reduce rounding error

1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider

f (x) = f (x + h) minus f (x minus h)

2h + error where error is O(h2)

h ln(3 + h) minus ln(3 minus h)2h error

01 03334568 124 times10minus4

001 03333345 123 times10minus6

0001 03333333 191 times10minus8

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

10 Applied Numerical Methods

Example 19

By the mean value theorem for integrals (Theorem 12 as in Example 16 onpage 7) if f isin C 2[a b] then

f (x) = f (x + h) minus f (x)

h +

1

h

x+h

x

f (t)(x + h minus t)dt

and

917501917501917501917501917501 1

h

x+h

x

f (t)(x + h minus t)dt

917501917501917501917501917501 le ch

Thus f (x) asymp (f (x + h) minus f (x))h and the error is O(h) We will call thisthe method error or truncation error as opposed to roundoff errors due tousing machine approximations

Now consider f (x) = ln x and approximate f (3) asymp ln(3+h)minusln 3h for h small

using a calculator having 11 digits The following results were obtained

h ln(3 + h) minus ln(3)

h Error =

1

3 minus ln(3 + h) minus ln(3)

h = O(h)

10minus1 03278982 544 times10minus3

10minus2 0332779 554 times10minus4

10minus3 03332778 555 times10minus5

10minus4 0333328 533 times10minus6

10minus5 0333330 333 times10minus6

10minus6 0333300 333 times10minus5

10minus7 0333 333 times10minus4

10minus8 033 333 times10minus3

10minus9 03 333 times10minus2

10minus10

00 333 times10minus1

One sees that in the first four steps the error decreases by a factor of 10as h is decreased by a factor of 10 (That is the method error dominates)However starting with h = 000001 the error increases (The error due to afinite number of digits ie roundoff error dominates)

There are two possible ways to reduce rounding error

1 The method error can be reduced by using a more accurate methodThis allows larger h to be used thus avoiding roundoff error Consider

f (x) = f (x + h) minus f (x minus h)

2h + error where error is O(h2)

h ln(3 + h) minus ln(3 minus h)2h error

01 03334568 124 times10minus4

001 03333345 123 times10minus6

0001 03333333 191 times10minus8

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

Mathematical Review and Computer Arithmetic 11

The error decreases by a factor of 100 as h is decreased by a factor of 10

2 Rounding error can be reduced by using more digits of accuracy suchas using double precision (or multiple precision) arithmetic

To fully understand and avoid roundoff error we should study some de-tails of how computers and calculators represent and work with approximatenumbers

121 Floating Point Arithmetic and Rounding Error

Let β = a positive integer the base of the computer system (Usuallyβ = 2 (binary) or β = 16 (hexadecimal)) Suppose a number x has the exactbase representation

x = (plusmn0α1α2α3 middot middot middot αtαt+1 middot middot middot )β m = plusmn qβ m

where q is the mantissa β is the base m is the exponent 1 le α1 le β minus 1 and0 le αi le β minus 1 for i gt 1

On a computer we are restricted to a finite set of floating-point numbers F = F (βtLU ) of the form xlowast = (plusmn0a1a2 middot middot middot at)β m where 1 le a1 le β minus 10 le ai le β minus 1 for 2 le i le t L le m le U and t is the number of digits (Inmost floating point systems L is about minus64 to minus1000 and U is about 64 to1000)

Example 110

(binary) β = 2

xlowast = (01011)23 =

1 times 12

+ 0 times 14

+ 1 times 18

+ 1 times 116

times 8

= 11

2 = 55 (decimal)

REMARK 11 Most numbers cannot be exactly represented on a com-puter Consider x = 101 = 10100001 1001 1001 (β = 2) If L = minus127 U =127 t = 24 and β = 2 then x asymp xlowast = (010100001 1001 1001 1001 1001)24

Question Given a real number x how do we define a floating point number

fl (x) in F such that fl (x) is close to xOn modern machines one of the following four ways is used to approximatea real number x by a machine-representable number fl (x)

round down fl (x) = x darr the nearest machine representable number to thereal number x that is less than or equal to x

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

12 Applied Numerical Methods

round up fl (x) = x uarr the nearest machine number to the real number x

that is greater than or equal to xround to nearest fl (x) is the nearest machine number to the real number

x

round to zero or ldquochoppingrdquo fl (x) is the nearest machine number to thereal number x that is closer to 0 than x The term ldquochoppingrdquo is becausewe simply ldquochoprdquo the expansion of the real number that is we simplyignore the digits in the expansion of x beyond the t-th one

The default on modern systems is usually round to nearest although choppingis faster or requires less circuitry Round down and round up may be usedbut with care to produce results from a string of computations that areguaranteed to be less than or greater than the exact result

Example 111

β = 10 t = 5 x = 012345666 middot middot middot times 107 Then

fl (x) = 012345 times 107 (chopping)

fl (x) = 012346 times 107 (rounded to nearest)

(In this case round down corresponds to chopping and round up correspondsto round to nearest)

See Figure 12 for an example with β = 10 and t = 1 In that figure theexhibited floating point numbers are (01) times 101 (02) times 101 (09) times 101

01 times 102

+

β mminus1 = 1

+ + + + + + + + + +

β mminust = 100 = 1

successive floating point numbersβ m = 101

FIGURE 12 An example floating point system β = 10 t = 1 andm = 1

Example 112

Let a = 0410 b = 0000135 and c = 0000431 Assuming 3-digit decimalcomputer arithmetic with rounding to nearest does a + (b + c) = (a + b) + cwhen using this arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

Mathematical Review and Computer Arithmetic 13

Following the ldquorounding to nearestrdquo definition of fl we emulate the opera-

tions a machine would do as followsa larr 0410 times 100 b larr 0135 times 10minus3 c larr 0431 times 10minus3

and

fl (b + c) = fl (0135 times 10minus3 + 0431 times 10minus3)

= fl (0566 times 10minus3)

= 0566 times 10minus3

so

fl (a + 0566 times 10minus3) = fl (0410 times 100 + 0566 times 10minus3)

= fl (0410 times 100

+ 0000566 times 100

)= fl (0410566 times 100)

= 0411 times 100

On the other hand

fl (a + b) = fl (0410 times 100 + 0135 times 10minus3)

= fl (0410000 times 100 + 0000135 times 100)

= fl (0410135 times 100)

= 0410 times 100

so

fl (0410 times 100 + c) = fl (0410 times 100 + 0431 times 10minus3)

= fl (0410 times 100 + 0000431 times 100)

= fl (0410431 times 100)

= 0410 times 100 = 0411 times 100

Thus the distributive law does not hold for floating point arithmetic withldquoround to nearestrdquo Furthermore this illustrates that accuracy is improved if numbers of like magnitude are added first in a sum

The following error bound is useful in some analyses

THEOREM 15

|x minus fl (x)| le 1

2|x|β 1minust p

where p = 1 for rounding and p = 2 for chopping

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1

Mathematical Review and Computer Arithmetic 25

131 Interval Arithmetic

In interval arithmetic we define operations on intervals which can be con-sidered as ordered pairs of real numbers We can think of each interval asrepresenting the range of possible values of a quantity The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval To state this symbolically let x = [x x] and y = [y y]and define the four elementary operations by

x 1048573 y = x 1048573 y | x isin x and y isin y for 1048573 isin + minus times divide (11)

Interval arithmeticrsquos usefulness derives from the fact that the mathematicalcharacterization in Equation (11) is equivalent to the following operational

definitions x + y = [x + y x + y]

xminus y = [x minus y x minus y]

xtimes y = [minxyxyxyxy maxxyxyxyxy]

1

x = [

1

x 1

x] if x gt 0 or x lt 0

xdivide y = x times 1

y

(12)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations but if such operationsare composed bounds on the ranges of real functions can be obtained Forexample if

f (x) = (x + 1)(x minus 1) (13)

then

f ([minus2 2]) =

[minus2 2] + 1

[minus2 2] minus 1

= [minus1 3][minus3 1] = [minus9 3]

which contains the exact range [minus1 3]

REMARK 15 In some definitions of interval arithmetic division byintervals containing 0 is defined consistent with (11) For example

[1 2]

[minus3 4]

= minusinfin

minus

1

3

1

4

infin= R

lowastminus

1

3

1

4

where Rlowast is the extended real number system 4 consisting of the real numberswith the two additional numbers minusinfin and infin This extended interval arith-

4also known as the two-point compactification of the real numbers

8132019 num chap 1

26 Applied Numerical Methods

metic5 was originally invented by William Kahan6 for computations with con-

tinued fractions but has wider use than that Although a closed system canbe defined for the sets arising from this extended arithmetic typically thecomplements of intervals (ie the unions of two semi-infinite intervals) areimmediately intersected with intervals to obtain zero one or two intervalsInterval arithmetic can then proceed using (12)

The power of interval arithmetic lies in its implementation on computers Inparticular outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions This makes a qualitative differencein scientific computations since the results are now intervals in which theexact result must lie It also enables use of floating point computations forautomated theorem proving

Outward rounding can be implemented on any machine that has downward

rounding and upward rounding such as any machine that complies with theIEEE 754 standard For example take x + y = [x + y x + y] If x + yis computed with downward rounding and x + y is computed with upwardrounding then the resulting interval z = [z z] that is represented in themachine must contain the exact range of x + y for x isin x and y isin y We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error

Interval arithmetic is only subdistributive That is if x y and z areintervals then

x(y + z) sube xy + xz but x(y + z) = xy + xz in general (14)

As a result algebraic expressions that would be equivalent if real values are

substituted for the variables are not equivalent if interval values are used Forexample if instead of writing (x minus 1)(x + 1) for f (x) in (13) suppose wewrite

f (x) = x2 minus 1 (15)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error Such a routine couldbe as follows

ALGORITHM 11

(Computing an interval whose end points are machine numbers and which encloses the range of x2)

5There are small differences in current definitions of extended interval arithmetic Forexample in some systems minusinfin and infin are not considered numbers but just descriptivesymbols In those systems [1 2][minus3 4] = (minusinfinminus13] cup [14infin) = R(minus13 14) See[31] for a theoretical analysis of extended arithmetic6who also was a major contributor to the IEEE 754 standard

8132019 num chap 1

Mathematical Review and Computer Arithmetic 27

INPUT x = [x x]

OUTPUT a machine-representable interval that contains the range of x

2

overx

IF x ge 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE IF x le 0 THEN

RETURN [x2 x2] where x2 is computed with downward rounding and x2 is computed with upward rounding

ELSE

1 Compute x2 and x2 with both downward and upward rounding that iscompute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 isin [x2l x2

u] and compute x2l and x2

u such that x2l and x2

u are machine representable numbers and x2 isin [x2

l x2u]

2 RETURN [0 max

x2u x2

u

]

END IF

END ALGORITHM 11

With Algorithm 11 and rewriting f (x) from (13) as in (15) we obtain

f ([minus2 2]) = [minus2 2]2

minus 1 = [0 4] minus 1 = [minus1 3]

which in this case is equal to the exact range of f over [minus2 2]In fact this illustrates a general principle If each variable in the expression

occurs only once then interval arithmetic gives the exact range to withinroundout error We state this formally as

THEOREM 18

(Fundamental theorem of interval arithmetic) Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn sin(x) exp(x)and log(x) etc as long as the interval values of these functions contain their range over the input intervals Then

1 The interval value f (x1 xn) contains the range of f over the inter-val vector (or box) (x1 xn)

8132019 num chap 1

28 Applied Numerical Methods

2 If the single functions (the elementary operations and functions xn etc)

have interval values that represent their exact ranges and if each vari-able xi 1 le i le n occurs only once in the expression for f then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals

If the expression for f contains one or more variables more than oncethen overestimation of the range can occur due to interval dependency Forexample when we evaluate our example function f ([minus2 2]) according to (13)the first factor [minus1 3] is the exact range of x + 1 for x isin [minus2 2] while thesecond factor [minus3 1] is the exact range of x minus 1 for x isin [minus2 2] Thus [minus9 3]is the exact range of f (x1 x2) = (x1 + 1)(x2 minus 1) for x1 and x2 independentx1 isin [minus2 2] x2 isin [minus2 2]

We now present some definitions and theorems to clarify the practical con-

sequences of interval dependency

DEFINITION 16 An expression for f (x1 xn) which is written sothat each variable occurs only once is called a single use expression or SUE

Fortunately we do not need to transform every expression into a single useexpression for interval computations to be of value In particular the intervaldependency becomes less as the widths of the input intervals becomes smallerThe following formal definition will help us to describe this precisely

DEFINITION 17 Suppose an interval evaluation f (x1 xn) gives

[a b] as a result interval but the exact range f (x1 xn) xi isin xi 1 le i le nis [c d] sube [a b] We define the excess width E (f x1 xn) in the interval evaluation f (x1 xn) by E (f x1 xn) = (c minus a) + (b minus d)

For example the excess width in evaluating f (x) represented as (x+1)(xminus1)over x = [minus2 2] is (minus1 minus (minus9)) + (3 minus 3) = 8 In general we have

THEOREM 19

Suppose f (x1 x2 xn) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn) and suppose that this expression is evaluated with interval arithmetic as in Theorem 18 to obtain an interval enclosure f (x1 xn) to the range of f for xi isin xi 1 le i le n

Then if E (f x1 xn) is as in Definition 17 we have

E (f x1 xn) = O max1leilen

w(xi)

where w(x) denotes the width of the interval x

8132019 num chap 1

Mathematical Review and Computer Arithmetic 29

That is the overestimation becomes less as the uncertainty in the arguments

to the function becomes smallerInterval evaluations as in Theorem 19 are termed first-order interval ex-tensions It is not difficult to obtain second-order extensions where required(See Exercise below)

132 Application of Interval Arithmetic Examples

We give one such example here

Example 122

Using 4-digit decimal floating point arithmetic compute an interval enclosurefor the first two digits of e and prove that these two digits are correct

Solution The fifth degree Taylor polynomial representation for e is

e = 1 + 1 + 1

2 +

1

3 +

1

4 +

1

5 +

1

6eξ

for some ξ isin [0 1] If we assume we know e lt 3 and we assume we know ex

is an increasing function of x then the error term is bounded by917501917501917501917501 1

6eξ917501917501917501917501 le 3

6 lt 0005

so this fifth-degree polynomial representation should be adequate We willevaluate each term with interval arithmetic and we will replace eξ with [1 3]We obtain the following computation

[1000 1000] + [1000 1000] rarr [2000 2000][1000 1000][2000 2000] rarr [05000 05000]

[2000 2000] + [05000 05000] rarr [2500 2500]

[1000 1000][6000 6000] rarr [01666 01667]

[2500 2500] + [01666 01667] rarr [2666 2667]

[1000 1000][2400 2400] rarr [004166 004167]

[2666 2667] + [004166 004167] rarr [2707 2709]

[1000 1000][1200 1200] rarr [0008333 0008334]

[2707 2709] + [0008333 0008334] rarr [2715 2718]

[1000 1000][7200 7200] rarr [0001388 0001389]

[001388 001389]times

[1 3]rarr

[0001388 0004167]

[2715 2718] + [0001388 0004167] rarr [2716 2723]

Since we used outward rounding in these computations this constitutes amathematical proof that e isin [2716 2723]Note

8132019 num chap 1

30 Applied Numerical Methods

1 These computations can be done automatically on a computer as simply

as evaluating the function in floating point arithmetic We will explainsome programming techniques for this in Chapter 6 Section 62

2 The solution is illustrative More sophisticated methods such as argu-ment reduction would be used in practice to bound values of ex moreaccurately and with less operations

Proofs of the theorems as well as greater detail appear in various textson interval arithmetic A good book on interval arithmetic is R E Moorersquosclassic text [27] although numerous more recent monographs and reviews areavailable A World Wide Web search on the term ldquointerval computationsrdquowill lead to some of these

A general introduction to interval computations is [26] That work gives

not only a complete introduction with numerous examples and explanationof pitfalls but also provides examples with intlab a free matlab toolbox forinterval computations and reference material for intlab If you have mat-lab available we recommend intlab for the exercises in this book involvinginterval computations

14 Programming Environments

Modern scientific computing (with floating point numbers) is usually done

with high-level ldquoimperativerdquo (as opposed to ldquofunctionalrdquo) programming lan-guages Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation) CC++ and matlabFortran is the original such language with its origins in the late 1950rsquos Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing Such soft-ware can be found for example on NETLIB at

httpwwwnetliborg

Fortran has evolved over the years becoming a modern multi-faceted lan-guage with the Fortran 2003 standard Throughout the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing and features that enable high performance

especially on computers that can process vectors and matrices efficientlyThe ldquoCrdquo language originally developed in conjunction with the Unix oper-

ating system was originally meant to be a higher-level language for designingand accessing the operating system but has become more ubiquitous sincethen C++ appearing in the late 1980rsquos was the first widely available lan-

8132019 num chap 1

Mathematical Review and Computer Arithmetic 31

guage7 to allow the object-oriented programming paradigm In recent years

computer science departments have favored teaching C++ over teaching For-tran and Fortran has fallen out of favor in relative terms However Fortran isstill favored in certain large-scale applications such as fluid dynamics (eg inweather prediction and similar simulations) and some courses are still offeredin it in engineering schools However some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77

Reasonably high-quality compilers for both Fortran and CC++ are avail-able free of charge with Linux operating systems Fortran 2003 largely im-plemented in these compilers has a standardized interface to C so functionswritten in C can be called from Fortran programs and visa versa These com-pilers include interactive graphical-user-interface-oriented debuggers such asldquoinsightrdquo available with the Linux operating system Commercially availablecompilation and debugging systems are also available under Windows

The matlab

system has become increasingly popular over the last twodecades or so The matlab (or MATrix LABoratory) began in the early1980rsquos as a National Science Foundation project written by Cleve Moler inFORTRAN 66 to provide an interactive environment for computing with ma-trices and vectors but has since evolved to be both an interactive environmentand full-featured programming language matlab is highly favored in coursessuch as this because the ease of programming debugging and general use(such as graphing) and because of the numerous toolboxes supplied by bothMathworks (Cleve Molerrsquos company) and others for many basic computingtasks and applications The main drawback to use of matlab in all scientificcomputations is that the language is interpretive that is matlab translateseach line of a program to machine language each time that it executes the lineThis makes complicated programs that involve nested iterations much slower

(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran However functions compiled from Fortran orCC++ can be called from matlab A common strategy has been to initiallydevelop algorithms in matlab then translate all or part of the program to acompilable language as necessary for efficiency

One perceived disadvantage of matlab is that it is proprietary Undesirablepossible consequences are that it is not free and there is no official guaranteethat it will be available forever unchanged However its use has becomeso widespread in recent years that these concerns do not seem to be majorSeveral projects including ldquoOctaverdquo and ldquoScilabrdquo have produced free productsthat partially support the matlab programming language The most widelydistributed of these ldquoOctaverdquo is integrated into Linux systems Howeverthe object-oriented features of Octave are rudimentary compared to those of matlab and some toolboxes such as intlab (which we will mention later)will not function with Octave

7with others including Fortran to follow

8132019 num chap 1

32 Applied Numerical Methods

Alternative systems sometimes used for scientific computing are computer

algebra systems Perhaps the most common of these are Mathematica

andMaple while a free such system under development is ldquoSAGErdquo These sys-tems admit a different way of thinking about programming termed functional programming in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system (In contrast inimperative programming a sequence of commands is executed one after theother) Although these systems have become comprehensive they are basedin computations of a different character rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing

We will use matlab in this book to illustrate the concepts techniques andapplications With newer versions of matlab a student can study how touse the system and make programs largely by using the matlab help system

The first place to turn will be the ldquoGetting startedrdquo demos which in newerversions are presented as videos There are also many books devoted to useof matlab Furthermore we will be giving examples throughout this book

matlab programs can be written as matlab scripts and matlab func-tions

Example 123 The matlab script we used to produce the table following Example 17 (onpage 8) is

a = 2

x=2

xold=x

err_old = 1

for k=010

k

x

err = x - sqrt(2)

err

ratio = errerr_old^2

err_old = err

x = x2 + 1x

end

Example 124

The matlab script we used to produce the table in Example 18 (on page 9)is

format long

a = 2

x=2

8132019 num chap 1

Mathematical Review and Computer Arithmetic 33

xold=x

err_old = 1

for k=025

k

x

err = x - sqrt(2)

err

ratio = errerr_old

err_old = err

x = x - x^235 + 235

end

An excellent alternative text book that focuses on matlab functions is

Cleve Molerrsquos Numerical Computing with Matlab [25] An on-line versionalong with ldquo mrdquo files etc is currently available at httpwwwmathworks

commolerchaptershtml

15 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems Thisincludes understanding and predicting and managing or controlling the evolu-tion of ecological systems and epidemics designing and constructing durable

but inexpensive bridges buildings roads water control structures under-standing chemical and physical processes designing chemical plants and elec-tronic components and systems minimizing costs or maximizing delivery of products or services within companies and governments etc To achieve thesegoals the numerical methods are a small part of the overall modeling processthat can be viewed as consisting of the following steps

Identify the problem This is the first step in translating an often vaguesituation into a mathematical problem to be solved What questionsmust be answered and how can they be quantified

Assumptions Which factors are to be ignored and which are importantThe real world is usually significantly more complicated than mathe-

matical models of it and simplifications must be made because somefactors are poorly understood because there isnrsquot enough data to de-termine some minor factors or because it is not practical to accuratelysolve the resulting equations unless the model is simplified For exam-ple the theory of relativity and variations in the acceleration of gravity

8132019 num chap 1

34 Applied Numerical Methods

due to the fact that the earth is not exactly round and due to the fact

that the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the batHowever such effects can be ignored when we write down a model of the trajectory of the baseball On the other hand we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth

Construction In this step we actually translate the problem into mathe-matical language

Analysis We solve the mathematical problem Here is where the numericaltechniques in this book come into play With more complicated modelsthere is an interplay between the previous three steps and this solution

process We may need to simplify the process to enable practical so-lution Also presentation of the result is important here to maximizethe usefulness of the results In the early days of scientific computingprintouts of numbers are used but increasingly results are presented astwo and three-dimensional graphs and movies

Interpretation The numerical solution is compared to the original problemIf it does not make sense go back and reformulate the assumptions

Validation Compare the model to real data For example in climate mod-els the model might be used to predict climate changes in past yearsbefore it is used to predict future climate changes

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored) that is outside the scope of our study of numerical methods Such error can only be measured indirectlythrough the interpretation and validation steps In the model solution process(the ldquoanalysisrdquo step) errors are also introduced due to roundoff error and theapproximation process We have seen that such error consists of approxima-tion error and roundoff error In a study of numerical methods and numericalanalysis we quantify and find bounds on such errors Although this may notbe the major source of error it is important to know Consequences of thistype of error might be that a good model is rejected that incorrect conclu-sions are deduced about the process being modeled etc The authors of thisbook have personal experience with these events

Errors in the modeling process can sometimes be quantified in the solution

process If the model depends on parameters that are not known precisely butbounds on those parameters are known knowledge of these bounds can some-times be incorporated into the mathematical equations and the set of possiblesolutions can sometimes be computed or bounded One tool that sometimesworks is interval arithmetic Other tools less mathematically definite but

8132019 num chap 1

Mathematical Review and Computer Arithmetic 35

applicable in different situations are statistical methods and computing solu-

tions to the model for many different values of the parameterThroughout this book we introduce applications from many areas

Example 125

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z = xy

x + y

Suppose the measured values of x and y are x = 1 and y = 2 respectivelyEstimate the range of possible values of z given that the true values of x andy are known to be within plusmn10 of the measured value

In this example the identification assumptions and construction have al-ready been done (It is well known how capacitances in a linear electricalcircuit behave) We are asked to analyze the error in the output of the com-putation due to errors in the data We may proceed using interval arithmeticrelying on the accuracy assumptions for the measured values In particularthese assumptions imply that x isin [09 11] and y isin [18 22] We will plugthese intervals into the expression for z but we first use Theorem 18 part (2)as a guide to rewrite the expression for z so x and y only occur once (Wedo this so we obtain sharp bounds on the range without overestimation)Dividing the numerator and denominator for z by xy we obtain

z = 11x + 1

y

We use the intlab toolbox8 for matlab to evaluate z We have the followingdialog in matlabrsquos command window

gtgt intvalinit(rsquoDisplayInfsuprsquo)

===gt Default display of intervals by infimumsupremum

gtgt x = intval(rsquo[0911]rsquo)

intval x =

[ 08999 11001]

gtgt y = intval(rsquo[1822]rsquo)

intval y =

[ 17999 22001]

gtgt z = 1(1x + 1y)

intval z =

[ 05999 07334]

gtgt format long

gtgt z

intval z =

8If one has matlab intlab is available free of charge for non-commercial use from http

wwwti3tu-harburgde~rumpintlab

8132019 num chap 1

36 Applied Numerical Methods

[ 059999999999999 073333333333334]

gtgt

Thus the capacitance must lie between 05999 and 07334Note that x and y are input as strings This is to assure that roundoff

errors in converting the decimal expressions 09 11 18 and 22 into internalbinary format are taken into account See [26] for more examples of the useof intlab

16 Exercises

1 Write down a polynomial p(x) such that|S(x)

minus p(x)

| le10minus10 for

minus02

lex le 02 where

S(x) =

sin(x)

x if x = 0

1 if x = 0

Note sinc(x) = S (πx) = sin(πx)(πx) is the ldquosincrdquo function (well-known in signal processing etc)

(a) Show that your polynomial p satisfies the condition |sinc(x) minus p(x)| le 10minus10 for x isin [minus02 02]Hint You can obtain polynomial approximations with error terms

for sinc (x) by writing down Taylor polynomials and corresponding error terms for sin(x) then dividing these by x This can be easier

than trying to differentiate sinc (x) For the proof part you can use for example the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds

(b) Plot your polynomial approximation and sinc(x) on the same graph

(i) over the interval [minus02 02]

(ii) over the interval [minus3 3]

(iii) over the interval [minus10 10]

2 Suppose f has a continuous third derivative Show that

917501917501917501917501f (x + h) minus f (x minus h)

2h minus f (x)917501917501917501917501 = O(h2)

3 Suppose f has a continuous fourth derivative Show that917501917501917501917501f (x + h) minus 2f (x) + f (x minus h)

h2 minus f (x)

917501917501917501917501 = O(h2)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 37

4 Let a = 041 b = 036 and c = 07 Assuming a 2-digit decimal com-

puter arithmetic with rounding show that a

minusb

c = a

c minus b

c when usingthis arithmetic

5 Write down a formula relating the unit roundoff δ of Definition 13(page 14) and the machine epsilon m defined on page 20

6 Store and run the following matlab script What are your resultsWhat does the script compute Can you say anything about the com-puter arithmetic underlying matlab

eps = 1

x = 1+eps

while(x~=1)

eps = eps2x = 1+eps

end

eps = eps+(2eps)^2

y = 1+eps

y-1

7 Suppose for illustration we have a system with base β = 10 t = 3decimal digits in the mantissa and L = minus9 U = 9 for the exponentFor example 0123 times 104 that is 1230 is a machine number in thissystem Suppose also that ldquoround to nearestrdquo is used in this system

(a) What is HUGE for this system

(b) What is TINY for this system

(c) What is the machine epsilon m for this system

(d) Let f (x) = sin(x) + 1

i Write down fl (f (0)) and fl (f (00008)) in normalized formatfor this toy system

ii Compute fl ( fl (f (00008)) minus fl (f (0))) On the other hand whatis the nearest machine number to the exact value of f (00008)minusf (0)

iii Compute fl ( fl (f (00008))minus fl (f (0))) fl (00008) Compare thisto the nearest machine number to the exact value of (f (00008)minusf (0))00008 and to f (0)

8 Let f (x) = ln(x + 1) minus ln(x)

2

(a) Use four-digit decimal arithmetic with rounding to evaluatef (100 000)

8132019 num chap 1

38 Applied Numerical Methods

(b) Use the Mean Value Theorem to approximate f (x) in a form that

avoids the loss of significant digits Use this form to evaluate f (x)for x = 100 000 once again

(c) Compare the relative errors for the answers obtained in (a) and(b)

9 Compute the condition number of f (x) = eradic x2minus1 x gt 1 and discuss

any possible ill-conditioning

10 Let f (x) = (sin(x))2 + x2 Use interval arithmetic to prove that thereare no solutions to f (x) = 0 for x isin [minus1 minus08]

8132019 num chap 1

14 Applied Numerical Methods

DEFINITION 13 δ = p

2β 1minust is called the unit roundoff error

Let = fl (x) minus x

x Then fl (x) = (1 + )x where || le δ With this we have

the following

THEOREM 16

Let 1048573 denote the operation + minus times or divide and let x and y be machine num-bers Then

fl (x 1048573 y) = (x 1048573 y)(1 + ) where || le δ = p

2 β 1minust

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem Such an analysis is called forward error analysis

Because of the properties of floating point arithmetic it is unreasonable todemand strict tolerances when the exact result is too large

Example 113

Suppose β = 10 and t = 3 (3-digit decimal arithmetic) and suppose we wishto compute 104π with a computed value x such that |104π minus x| lt 10minus2 Theclosest floating point number in our system to 104π is x = 0314times105 = 31400However |104π minus x| = 15926 Hence it is impossible to find a number xin the system with

|104π

minusx

|lt 10minus2

The error |104π minus x| in this example is called the absolute error in approx-imating 104π We see that absolute error is not an appropriate measure of error when using floating point arithmetic For this reason we use relative error

DEFINITION 14 Let xlowast be an approximation to x Then |x minus xlowast| is

called the absolute error and

917501917501917501917501x minus xlowast

x

917501917501917501917501 is called the relative error

For example 917501917501917501917501x minus fl (x)

x 917501917501917501917501 le δ = p

2 β 1minust (unit roundoff error)

1211 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large and explain how to avoid many of these situations

8132019 num chap 1

Mathematical Review and Computer Arithmetic 15

Example 114

β = 10 t = 4 p = 1 (Thus δ =

1

210minus3

= 00005) Let x = 05795 times 10

5

y = 06399 times 105 Then

fl (x + y) = 01219 times 106 = (x + y)(1 + 1) 1 asymp minus328 times 10minus4 |1| lt δ and fl (xy) = 03708 times 1010 = (xy)(1 + 2) 2 asymp minus595 times 10minus5 |2| lt δ

(Note x + y = 012194 times 106 xy = 037082205 times 1010)

Example 115

Suppose β = 10 and t = 4 (4 digit arithmetic) suppose x1 = 10000 andx2 = x3 = middot middot middot = x1001 = 1 Then

fl (x1 + x2) = 10000

fl (x1 + x2 + x3) = 10000

fl

10011048573i=1

xi

= 10000

when we sum forward from x1 But going backwards

fl (x1001 + x1000) = 2

fl (x1001 + x1000 + x999) = 3

fl

11048573

i=1001

xi

= 11000

which is the correct sum

This example illustrates the point that large relative errors occur when

a large number of almost small numbers is added to a large number or when a very large number of small almost-equal numbers is added To avoidsuch large relative errors one can sum from the smallest number to the largestnumber However this will not work if the numbers are all approximatelyequal In such cases one possibility is to group the numbers into sets of two

adjacent numbers summing two almost equal numbers together One thengroups those results into sets of two and sums these together continuing untilthe total sum is reached In this scheme two almost equal numbers are alwaysbeing summed and the large relative error from repeatedly summing a smallnumber to a large number is avoided

8132019 num chap 1

16 Applied Numerical Methods

Example 116

x1 = 15314768 x2 = 15314899 β = 10 t = 6 (6-digit decimal accuracy)Then x2 minus x1 asymp fl (x2) minus fl (x1) = 153149 minus 153148 = 00001 Thus917501917501917501917501x2 minus x1 minus ( fl (x2) minus fl (x1))

x2 minus x1

917501917501917501917501 = 0000131 minus 00001

0000131

= 0237

= 237 relative accuracy

This example illustrates that large relative errors can occur when

two nearly equal numbers are subtracted on a computer Sometimesan algorithm can be modified to reduce rounding error occurring from this

source as the following example illustrates

Example 117

Consider finding the roots of ax2 + bx + c = 0 where b2 is large comparedwith |4ac| The most common formula for the roots is

b2 minus 4ac

2a

Consider x2 + 100x + 1 = 0 β = 10 t = 4 p = 2 and 4-digit choppedarithmetic Then

x1 = minus100 +

2 x2 = minus100

minus

2

9996 asymp 9997 (4 digit arithmetic chopped) Thus

x1 asymp minus100 + 9997

2 x2 asymp minus100 minus 9997

2

Hence x1 asymp minus0015 x2 asymp minus9998 but x1 = minus0010001 and x2 = minus99989999so the relative errors in x1 and x2 are 50 and 001 respectively

Letrsquos change the algorithm Assume b ge 0 (can always make b ge 0) Then

x1 = minusb +

2a

b2 minus 4ac

minusb minus radic b2 minus 4ac

b2 minus 4ac)= minus2c

b2 minus 4ac

and

b2 minus 4ac

2a (the same as before)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 17

Then for the above values

x1 = minus2(1)

9996asymp minus2

100 + 9997 = minus00100

Now the relative error in x1 is also 001

Let us now consider error in function evaluation Consider a single valuedfunction f (x) and let xlowast = fl (x) be the floating point approximation of xTherefore the machine evaluates f (xlowast) = f ( fl (x)) which is an approximatevalue of f (x) at x = xlowast Then the perturbation in f (x) for small perturbationsin x can be computed via Taylorrsquos formula This is illustrated in the nexttheorem

THEOREM 17 The relative error in functional evaluation is917501917501917501917501f (x) minus f (xlowast)

f (x)

917501917501917501917501 asymp917501917501917501917501x f (x)

f (x)

917501917501917501917501917501917501917501917501x minus xlowast

x

917501917501917501917501PROOF The linear Taylor approximation of f (xlowast) about f (x) for small

values of |x minus xlowast| is given by f (xlowast) asymp f (x) + f (x)(xlowast minus x) Rearranging theterms immediately yields the result

This leads us to the following definition

DEFINITION 15 The condition number of a function f (x) is

κf (x) =

917501917501917501917501x f (x)

f (x)

917501917501917501917501The condition number describes how large the relative error in function

evaluation is with respect to the relative error in the machine representationof x In other words κf (x) is a measure of the degree of sensitivity of thefunction at x

Example 118

x The condition number of f (x) about x is

κf (x) =

x

917501917501917501917501917501 = 1

2

This suggests that f (x) is well-conditioned

8132019 num chap 1

18 Applied Numerical Methods

Example 119

Let f (x) =

radic x minus 2 The condition number of f (x) about x is

κf (x) =

917501917501917501917501 x

2(x minus 2)

917501917501917501917501

This is not defined at xlowast = 2 Hence the function f (x) is numerically unstableand ill-conditioned for values of x close to 2

REMARK 12 If x = f (x) = 0 then the condition number is simply|f (x)| If x = 0 f (x) = 0 (or f (x) = 0 x = 0) then it is more usefulto consider the relation between absolute errors than relative errors Thecondition number then becomes |f (x)f (x)|

REMARK 13 Generally if a numerical approximation z to a quantityz is computed the relative error is related to the number of digits after thedecimal point that are correct For example if z = 00000123453 and z =000001234543 we say that z is correct to 5 significant digits Expressingz as 0123453 times 10minus4 and z as 0123454 times 10minus4 we see that if we round zto the nearest number with five digits in its mantissa all of those digits arecorrect whereas if we do the same with six digits the sixth digit is notcorrect Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error ratherthan ldquonumber of digits after the decimal pointrdquo which can have a differentmeaning (Here one might say that z is correct to 9 digits after the decimalpoint)

122 Practicalities and the IEEE Floating Point Standard

Prior to 1985 different machines used different word lengths and differentbases and different machines rounded chopped or did something else to formthe internal representation fl (x) for real numbers x For example IBM main-frames generally used hexadecimal arithmetic (β = 16) with 8 hexadecimaldigits total (for the base sign and exponent) in ldquosingle precisionrdquo numbersand 16 hexadecimal digits total in ldquodouble precisionrdquo numbers Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or ldquobitsrdquo) total in single precision numbers and 72 binary digitstotal in double precision numbers An unusual machine designed at MoscowState University from 1955-1965 the ldquoSetunrdquo even used base-3 (β = 3 orldquoternaryrdquo) numbers Some computers had 32 bits total in single precision

numbers and 64 bits total in double precision numbers while some ldquosuper-computersrdquo (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10 β = 10)

8132019 num chap 1

Mathematical Review and Computer Arithmetic 19

arithmetic say with L = minus999 and U = 999 and t = 14 digits in the man-

tissaExcept for the Setun (the value of whose ternary digits corresponded toldquopositiverdquo ldquonegativerdquo and ldquoneutralrdquo in circuit elements or switches) digitalcomputers are mostly based on binary switches or circuit elements (that isldquoonrdquo or ldquooffrdquo) so the base β is usually 2 or a power of 2 For example theIBM hexadecimal digit could be viewed as a group of 4 binary digits2

Older floating point implementations did not even always fit exactly intothe model we have previously described For example if x is a number in thesystem then minusx may not have been a number in the system or if x were anumber in the system then 1x may have been too large to be representablein the system

To promote predictability portability reliability and rigorous error bound-ing in floating point computations the Institute of Electrical and Electronics

Engineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985 IEEEANSI 754-1985 Standard for Binary Floating Point Arithmetic often referenced asldquoIEEE-754rdquo or simply ldquothe IEEE standard3rdquo Almost all computers in exis-tence today including personal computers and workstations based on IntelAMD Motorola etc chips implement most of the IEEE standard

In this standard β = 2 32 bits total are used in a single precision number(an ldquoIEEE singlerdquo) and 64 bits total are used for a double precision number(ldquoIEEE doublerdquo) In a single precision number 1 bit is used for the sign 8 bitsare used for the exponent and t = 23 bits are used for the mantissa In doubleprecision numbers 1 bit is used for the sign 11 bits are used for the exponentand 52 bits are used for the mantissa Thus for single precision numbersthe exponent is between 0 and (11111111)2 = 255 and 128 is subtracted

from this to get an exponent between minus127 and 128 In IEEE numbersthe minimum and maximum exponent are used to denote special symbols(such as infinity and ldquounnormalizedrdquo numbers) so the exponent in singleprecision represents magnitudes between 2minus126 asymp 10minus38 and 2127 asymp 1038 Themantissa for single precision numbers represents numbers between (20 = 1

and23

i=0 2minusi = 2(1 minus 2minus24) asymp 2 Similarly the exponent for double precisionnumbers is effectively between 2minus1022 asymp 10minus308 and 21023 asymp 10308 while themantissa for double precision numbers represents numbers between 20 = 1and

52i=0 2minusi asymp 2

Summarizing the parameters for IEEE arithmetic appear in Table 11In many numerical computations such as solving the large linear systems

arising from partial differential equation models more digits or a larger ex-ponent range is required than is available with IEEE single precision For

2An exception is in some systems for business calculations where base 10 is implemented3An update to the 1985 standard was made in 2008 This update gives clarifications of certain ambiguous points provides certain extensions and specifies a standard for decimalarithmetic

8132019 num chap 1

20 Applied Numerical Methods

TABLE 11 Parameters for

IEEE arithmeticprecision β L U t

single 2 -126 127 24double 2 -1022 1023 53

this reason many numerical analysts at present have adopted IEEE doubleprecision as the default precision For example underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision

IEEE arithmetic provides four ways of defining fl (x) that is four ldquoroundingmodesrdquo namely ldquoround downrdquo ldquoround uprdquo ldquoround to nearestrdquo and ldquoround

to zerordquo are specified as follows The four elementary operations + minus timesand must be such that fl (x1048573y) is implemented for all four rounding modesfor 1048573 isin minus + times

radic middotThe default mode (if the rounding mode is not explicitly set) is normally

ldquoround to nearestrdquo to give an approximation after a long string of compu-tations that is hopefully near the exact value If the mode is set to ldquorounddownrdquo and a string of computations is done then the result is less than orequal to the exact result Similarly if the mode is set to ldquoround uprdquo thenthe result of a string of computations is greater than or equal to the exactresult In this way mathematically rigorous bounds on an exact result can beobtained (This technique must be used astutely since naive use could resultin bounds that are too large to be meaningful)

Several parameters more directly related to numerical computations than

L U and t are associated with any floating point number system These are

HUGE the largest representable number in the floating point system

TINY the smallest positive representable number in the floating point system

m the machine epsilon the smallest positive number which when added to1 gives something other than 1 when using the rounding modendashroundto the nearest

These so-called ldquomachine constantsrdquo appear in Table 12 for the IEEE singleand IEEE double precision number systems

For IEEE arithmetic 1TINY lt HUGE but 1HUGE lt TINY This brings upthe question of what happens when the result of a computation has absolute

value less than the smallest number representable in the system or has ab-solute value greater than the largest number representable in the system Inthe first case an underflow occurs while in the second case an overflow oc-curs In floating point computations it is usually (but not always) reasonableto replace the result of an underflow by 0 but it is usually more problematical

8132019 num chap 1

Mathematical Review and Computer Arithmetic 21

TABLE 12 Machine constants for IEEE arithmetic

Precision HUGE TINY m

single 2127 asymp 340 middot 1038 2minus126 asymp 118 middot 10minus38 2minus24 + 2minus45 asymp 596 middot 10minus8

double 21023 asymp 179 middot 10308 2minus1022 asymp 223 middot 10minus308 2minus53 + 2minus105 asymp 111 middot 10minus16

when an overflow occurs Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred

The IEEE standard specifies representations for special numbers infin minusinfin+0 minus0 and NaN where the latter represents ldquonot a numberrdquo The standardspecifies that computations do not stop when an overflow or underflow occursor when quantities such as

1 10 minus

10 etc are encountered (althoughmany programming languages by default or optionally do stop) For examplethe result of an overflow is set to infin whereas the result of

radic minus1 is set to NaNand computation continues The standard also specifies ldquogradual underflowrdquothat is setting the result to a ldquodenormalizedrdquo number or a number in thefloating point format whose first digit in the mantissa is equal to 0 Com-putation rules for these special numbers such as NaN times any number = NaNinfin times any positive normalized number = infin allow such ldquononstoprdquo arithmetic

Although the IEEE nonstop arithmetic is useful in many contexts the nu-merical analyst should be aware of it and be cautious in interpreting resultsIn particular algorithms may not behave as expected if many intermediateresults contain infin or NaN and the accuracy is less than expected when denor-malized numbers are used In fact many programming languages by default

or with a controllable option stop if infin or NaN occurs but implement IEEEnonstop arithmetic with an option

Example 120

IEEE double precision floating point arithmetic underlies most computationsin matlab (This is true even if only four decimal digits are displayed) Oneobtains the machine epsilon with the function eps one obtains TINY with thefunction realmax and one obtains HUGE with the function realmin Observethe following matlab dialog

gtgt epsm = eps(1d0)

epsm = 22204e-016

gtgt TINY = realmin

TINY = 22251e-308gtgt HUGE = realmax

HUGE = 17977e+308

gtgt 1TINY

ans = 44942e+307

gtgt 1HUGE

8132019 num chap 1

22 Applied Numerical Methods

ans = 55627e-309

gtgt HUGE^2

ans = Inf

gtgt TINY^2

ans = 0

gtgt new_val = 1+epsm

new_val = 10000

gtgt new_val - 1

ans = 22204e-016

gtgt too_small = epsm2

too_small = 11102e-016

gtgt not_new = 1+too_small

not_new = 1

gtgt not_new - 1

ans = 0

gtgt

Example 121

(Illustration of underflow and overflow) Suppose for the purposes of illustra-tion we have a system with β = 10 t = 2 and one digit in the exponent sothat the positive numbers in the system range from 010 times 10minus9 to 099 times 109and suppose we wish to compute N =

x21 + x2

2 where x1 = x2 = 106 Thenboth x1 and x2 are exactly represented in the system and the nearest floatingpoint number in the system to N is 014 times 107 well within range Howeverx21 = 1012 larger than the maximum floating point number in the system

In older systems an overflow usually would result in stopping the compu-

tation while in IEEE arithmetic the result would be assigned the symbolldquoInfinityrdquo The result of adding ldquoInfinityrdquo to ldquoInfinityrdquo then taking the squareroot would be ldquoInfinityrdquo so that N would be assigned ldquoInfinityrdquo Similarlyif x1 = x2 = 10minus6 then x2

1 = 10minus12 smaller than the smallest representablemachine number causing an ldquounderflowrdquo On older systems the result is usu-ally set to 0 On IEEE systems if ldquogradual underflowrdquo is switched on theresult either becomes a denormalized number with less than full accuracyor is set to 0 without gradual underflow on IEEE systems the result is setto 0 When the result is set to 0 a value of 0 is stored in N whereas theclosest floating point number in the system is 014 times 10minus5 well within rangeTo avoid this type of catastrophic underflow and overflow in the computationof N we may use the following scheme

1 slarr

max|

x1

||x2

|

2 η1 larr x1s η2 larr x2s

3 N larr s

η21 + η22

8132019 num chap 1

Mathematical Review and Computer Arithmetic 23

1221 Input and Output

For examining the output to large numerical computations arising frommathematical models plots graphs and movies comprised of such plots andgraphs are often preferred over tables of values However to develop suchmodels and study numerical algorithms it is necessary to examine individualnumbers Because humans are trained to comprehend decimal numbers moreeasily than binary numbers the binary format used in the machine is usuallyconverted to a decimal format for display or printing In many programminglanguages and environments (such as all versions of Fortran C C++ andin matlab) the format is of a form similar to plusmnd1d2d3dmeplusmnδ 1δ 2δ 3 orplusmnd1d2d3dmEplusmnδ 1δ 2δ 3 where the ldquoerdquo or ldquoErdquo denotes the ldquoexponentrdquo of 10 For example -100e+003 denotes minus1 times 103 = minus1000 Numbers areusually also input either in a standard decimal form (such as 0001) or inthis exponential format (such as 10e-3) (This notation originates fromthe earliest computers where the only output was a printer and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet)

Thus for input a decimal fraction needs to be converted to a binary float-ing point number while for output a binary floating point number needsto be converted to a decimal fraction This conversion necessarily is inexactFor example the exact decimal fraction 01 converts to the infinitely repeat-ing binary expansion (000011)2 which needs to be rounded into the binaryfloating point system The IEEE 754 standard specifies that the result of adecimal to binary conversion within a specified range of input formats bethe nearest floating point number to the exact result over a specified rangeand that within a specified range of formats a binary to decimal conversion

be the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed)

Thus the number that one sees as output is usually not exactly the num-ber that is represented in the computer Furthermore while the floatingpoint operations on binary numbers are usually implemented in hardware orldquofirmwarerdquo independently of the software system the decimal to binary andbinary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran C C++ Java etc) or softwaresystem (such as matlab) The individual standards for these languages if there are any may not specify accuracy for such conversions and the lan-guages sometimes do not conform to the IEEE standard That is the numberthat one sees printed may not even be the closest number in that format tothe actual number

This inexactness in conversion usually does not cause a problem but maycause much confusion in certain instances In those instances (such as inldquodebuggingrdquo or finding programming blunders) one may need to examinethe binary numbers directly One way of doing this is in an ldquooctalrdquo or base-8format in which each digit (between 0 and 7) is interpreted as a group of

8132019 num chap 1

24 Applied Numerical Methods

three binary digits or in hexadecimal format (where the digits are 0-9 A B

C D E F) in which each digit corresponds to a group of four binary digits

1222 Standard Functions

To enable accurate computation of elementary functions such as sin cosand exp IEEE 754 specifies that a ldquolongrdquo 80-bit register (with ldquoguard digitsrdquo)be available for intermediate computations Furthermore IEEE 754-2008 anofficial update to IEEE 754-1985 provides a list of functions it recommends beimplemented and specifies accuracy requirements (in terms of correct round-ing ) for those functions a programming language elects to implement

REMARK 14 Alternative number systems such as variable precisionarithmetic multiple precision arithmetic rational arithmetic and combina-

tions of approximate and symbolic arithmetic have been investigated andimplemented These have various advantages over the traditional floatingpoint arithmetic we have been discussing but also have disadvantages andusually require more time more circuitry or both Eventually with the ad-vance of computer hardware and better understanding of these alternativesystems their use may become more ubiquitous However for the foreseeablefuture traditional floating point number systems will be the primary tool innumerical computations

13 Interval Computations

Interval computations are useful for two main purposes

bull to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundoff error)

bull to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes

In complicated traditional floating point algorithms naive arrangement of in-

terval computations usually gives bounds that are too wide to be of practicaluse For this reason interval computations have been ignored by many How-ever used cleverly and where appropriate interval computations are powerfuland provide rigor and validation when other techniques cannot

Interval computations are based on interval arithmetic

8132019 num chap 1