Numerical Methods

## Methodological and roundoff errors

In the following, we will see typical examples of different types of errors which occur in computer programs. We will also discuss possible ways to recognize and minimize the effects of methodological and roundoff errors.

### Connection between roundoff errors and algorithm

The first example shows that the roundoff error on the evaluation of a mathematical expression can be reduced through an appropriate reformulation of the algorithm.

Let us imagine we wish to evaluate numerically the following expression:

$$$F(x) = \frac{\sqrt{1+x} - \sqrt{1-x}}{x},$$$

for $x<<1$. Since $F(x)$ has a well-defined analytical expression, in this case there is no methodological error. Furthermore, we can assume that in this case there are no input or output errors. If we code this formula without further reshaping, we obtain the following result for decreasing values of $x$:



Two features immediately catch our attention in the values of $F(x)$ shown in the above table:

• The value of the function does not tend uniformly to 1 for $x \to 0$, as it should.
• For values of $x$ smaller than $x=10^{-16}$, $F$ is exactly zero.

This behaviour can be easily explained:

• For small $x$ the two square roots $\sqrt{1+x}$ and $\sqrt{1-x}$ represent two almost equal numbers. Subracting these numbers from each other, and dividing by $x$, the two roundoff errors, that are by themselves small, give a large effect, due to the law of error propagation ('subtractive cancellation').
• When $x$ is less than the machine precision $\tau$, we have that both $1+x=1$ and $1-x=1$; therefore, the value of $F(x)$ is exactly zero.

So far, we have only identified the error. The cure in this case is very simple:

If we rewrite $F(x)$ as:

$$$F1(x) \equiv F(x) = \frac{(\sqrt{1+x}-\sqrt{1-x})}{x} \cdot \frac{(\sqrt{1+x}+\sqrt{1-x})}{(\sqrt{1+x}+\sqrt{1-x})} = \frac{2}{\sqrt{1+x}+\sqrt{1-x}}$$$

we recover the correct results.



Another example from [Press,1992] shows that we can end up in trouble also in case of seemingly very simple problems, if we program the formulas without thinking.

Let us consider the numerical solution of a second-order equation: $$$ax^{2}+bx+c=0$$$ with real coefficients $a$, $b$, and $c$.

The well-known solutions are,

$$$x_{1}=\frac{-b+\sqrt{b^{2}-4ac}}{2a} \qquad x_{2}=\frac{-b-\sqrt{b^{2}-4ac}}{2a}.$$$

We immediately see that for very small values of the coefficients $a$ and/or $c$ there is the danger of a 'subtractive cancellation' on the values of $x_{1}$ and $x_{2}$ (for $b>0$ and $b<0$ respectively).

However, it is also easy to recast the above expression in an equivalent form that is stable against roundoff errors,

$$$x_{1}=\frac{c}{q} \qquad x_{2}=\frac{q}{a}$$$

where

$$$q \equiv -\frac{1}{2} \left[ b + \text{sgn}(b) \sqrt{b^{2}-4ac} \right].$$$

Try the two forms of the solutions for small values of $a$ and/or $c$ using the code below.



The standard formulas yield:

The formulas stable against roundoff errors yield:

### Roundoff and methodological errors in numerical differentiation

The basic principle of a numerical differentiation (derivation) is that of replacing the required differential with a suitable finite difference expression, i.e.

$$\label{e4} \frac{d}{dx} f(x)\mid _{x=x_{o}} \approx \frac{f(x_{o}+h) - f(x_{o})}{h}.$$

The differential ratio (\ref{e4}) converges to the first derivative of the function $f(x)$ in $x_{o}$ when the increment $h \to 0$. A finite increment (or stepsize) $h$ introduces a methodological error $\epsilon_{V}$, which gets smaller and smaller with decreasing $h$.

Theoretical considerations show that $\epsilon_{V}$ must have the form,

$$\label{e5} \epsilon_{V}=C_{V}(h) \cdot h,$$

where the quantity $C_{V}(h)$ in many cases depends only weakly on $h$, and can be replaced in practice by a constant:

$$$C_{V}(h) \approx C_{V}.$$$

If we plot $\epsilon_{V}$ vs $h$ in a log-log scale, the error curve satisfies the linear relation,

$$\label{e6} \log \epsilon_{V} = \log C_{V} + \log h$$

with slope +1.

The error is calculated for a concrete example,

$$$f(x)=\ln (x) \sin(5x),$$$ $$$\frac{d}{dx} f(x) \mid _{x_{o}=3} = \frac{\sin(15)}{3}+5 \ln 3 \cos(15).$$$

The error is,

$$$\epsilon_V = \frac{\sin(15)}{3}+5 \ln 3 \cos(15) -\frac{\ln (3+h) \sin(5(3+h)) - \ln (3) \sin(15)}{h}.$$$

The error is plotted below.


Log10(eV)

Log10(h)

The log of the error is proportional to log(h) in the interval 10-8 < h < 10-1.

Clearly the relation (\ref{e5}) is well fulfilled in the interval $10^{-8} < h < 10^{-1}$ . For increments that are larger or smaller than these values, the 'real' behaviour of the error deviates greatly from this behaviour.

The causes for this behaviour are the following:

• The deviation in the interval $h > 10^{-1}$ is due to the fact that, if the increment is too large, $C_{V}$ is not constant.
• It is interesting to analyze the behaviour of the error in the interval $h<10^{-8}$, where the error does not follow the expected linear behaviour with $h$ -- Eq. (\ref{e5}). On the other contrary, it acquires a strong functional dependence, and this is due to a strong "subtractive cancellation" of the roundoff errors in (\ref{e4}).

The dependence of the roundoff error on $h$ does not follow a smooth curve, but tends to the expression:

$$\label{e7} \epsilon_{R}=\frac{C_{R}}{h}.$$

From what we said so far, we can conclude:

• Using an optimal increment step $h_{opt}$ we can reduce the error down to a minimal value which cannot be further reduced -- in concrete examples this error is around $5\cdot 10^{-8}$.
• Using an increment step $h < h_{opt}$ does not improve the numerical result, but - on the contrary - makes it worse!

There are two possible strategies to further reduce this minimal error:

• The roundoff error can be reduced increasing the precision with which real numbers are stored in memory (single $\to$ double; float $\to$ double etc.).
• The methodological error can be reduced using an improved algorithm, for example using the more efficient formula,

$$\label{e8} \frac{d}{dx}f(x)\mid_{x=x_{o}} \approx \frac{f(x_{o}+h)-f(x_{o}-h)}{2h}$$ for the differential ratio. Since in this formula $\epsilon_{V}$ decreases much faster with decreasing $h$ as compared to expression (\ref{e4}), i.e. $$\label{e9} \epsilon_{V} = \bar C_{V}(h) \cdot h^{2} \quad ,$$ but the roundoff error has the same dependence on $h$ as before, the minimal error can be reduced by more than an order of magnitude -- see the red curve in the plot below.

Log10V)

Log10(h)

$$$\epsilon_{V1} = \frac{\sin(15)}{3}+5 \ln 3 \cos(15) -\frac{\ln (3+h) \sin(5(3+h)) - \ln (3) \sin(15)}{h}.$$$ $$$\epsilon_{V2} = \frac{\sin(15)}{3}+5 \ln 3 \cos(15) -\frac{\ln (3+h) \sin(5(3+h)) - \ln (3-h) \sin(5(3-h))}{2h}.$$$

In summary, we can say:

### Error diagnostics in the numerical evaluation of the error function

In many applications one has to evaluate numerically the function

$$$\mbox{erfc}(x) = 1 - \mbox{erf}(x),$$$

where erf$(x)$ represents the Gaussian error function.

The integral representation, i.e. the representation in form of a Taylor series is:

$$$\mbox{erfc}(x)=1-\frac{2}{\sqrt\pi} \int_{z=0}^{x} dz e^{-z^{2}} = 1-\frac{2}{\sqrt\pi} \sum_{n=0}^{\infty} \frac{(-1)^{n} x^{2n+1}} {n!(2n+1)}.$$$

The numerical evaluation of erfc$(x)$ can be done using this Taylor series, whose convergence in principle is assured for all real arguments $x$:

$$$\mbox{erfc}(x) \approx 1-\frac{2}{\sqrt\pi} \sum_{n=0}^{nmax} a_{n} \quad \text{with} \quad a_{n}=\frac{(-1)^{n} x^{2n+1}}{n!(2n+1)} \quad .$$$
• From the point of view of computational efficiency it is extremely inconvenient to evaluate each term $a_{n}$ of the series independently. It is much better to calculate each $a_{n}$ from the previous $a_{n-1}$. In fact, to calculate the ratio $a_{n}/a_{n-1}$, we can exploit the recursive relation:

$$$a_{n} = \left[-\frac{x^{2}(2n-1)}{n(2n+1)}\right]\, a_{n-1}\qquad \mbox{for}\quad n=1,2,\ldots \qquad \text{with} \qquad a_{0} = x\,.$$$
• Truncating the series at $n=n_{max}$ introduces a methodological error. However, since the series is composed of monotonically descending terms with alternating sign, we can safely assume that the norm of $\epsilon_{V}$ is not larger than the first neglected term $a_{nmax+1}$.

Using this simple method to estimate the methodological error it is possible to compute the series up to machine precision, i.e. until the final result is not changed including further terms in the sum. In conclusion: within machine precision we can eliminate the methodological error!



We have thus shown that all deviations of the numerical results from the exact one can be attributed to the accumulation of roundoff errors. We can immediately verify this, repeating the same calculation with single and double precision. The results of these tests for different values of $x$ are given in the Table below.

 $x$ Single Precision Double precision 0.01 0.9887166E+00 0.9887166D+00 0.1 0.8875371E+00 0.8875371D+00 0.5 0.4795001E+00 0.4795001D+00 1. 0.1572992E+00 0.1572992D+00 2. 0.4677685E-02 0.4677735D-02 3. 0.2991446E-04 0.2209050D-04 4. 0.2364931E-02 0.1544033D-07 5. 0.4645361E+02 0.5458862D-07

Comparison of the evaluation of the function erfc$(x)$ through a Taylor series with single precision and double precision. All results are given in half-exponential form with 7 digits for the mantissa. The exponent E indicates a calculation in simple precision, the exponent D a calculation in double precision. (Note: JavaScript only calculates with double precision numbers.)

While a Taylor series expansion approximates the value of erfc$(x)$ quite well for small arguments $x$, roundoff errors make it less accurate for large $x$. A different algorithm should be employed to calculate erfc$(x)$ for large $x$.

An alternative method for calculating erfc$(x)$ is to use a continued fraction,

$$$\text{erfc}(x) = \frac{e^{-x^{2}}}{\sqrt\pi} \frac{1}{x+ \frac{a_1}{x+\frac{a_2}{x+\frac{a_3}{x+\cdots} }}}\qquad a_m=\frac{m}{2}$$$

The numerical calculation of a continued fraction has the disadvantage that we have to decide how many terms of the fraction we want to retain before the beginning of the calculation. This means that if we want to increase the number of terms of the series to reduce the methodological error, we have to restart the calculation from scractch, and cannot simply add further terms to an 'old' calculation as in the case of Taylor series. The code below uses a continued fraction with $m$ terms to calculate the complementary error function.



In the table below, calculations of the complementary error function are shown in both in single precision and double precision for calculations using the taylor expansion algorithm and the continued fraction algorithm with $m=31$ and $m=61$.

 $x$ Taylor series $\epsilon_{V}=0$ Cont. Fract. 31 terms Cont. Fract. 61 terms 0.01 0.9887166E+00 0.1261318E+02 0.9042203E+01 0.9887166D+00 0.1261318D+02 0.9042202D+01 0.1 0.8875371E+00 0.1401417E+01 0.1131721E+01 0.8875371D+00 0.1401417D+01 0.1131721D+01 0.5 0.4795001E+00 0.4802107E+00 0.4795305E+00 0.4795001D+00 0.4802107D+00 0.4795305D+00 1. 0.1572992E+00 0.1572995E+00 0.1572992E+00 0.1572992D+00 0.1572995D+00 0.1572992D+00 2. 0.4677685E-02 0.4677735E-02 0.4677735E-02 0.4677735D-02 0.4677735D-02 0.4677735D-02 3. 0.2991446E-04 0.2209050E-04 0.2209050E-04 0.2209050D-04 0.2209050D-04 0.2209050D-04 4. 0.2364931E-02 0.1541726E-07 0.1541726E-07 0.1544033D-07 0.1541726D-07 0.1541726D-07 5. 0.4645361E+02 0.1537460E-11 0.1537460E-11 0.5458862D-07 0.1537460D-11 0.1537460D-11

All results are given in half-exponential form with 7 digits for the mantissa. The exponent E indicates a calculation in simple precision, the exponent D a calculation in double precision.

By examining this table we can draw the following conclusions:

• Differences in the results with 31 and 61 terms $\to$ methodological error.
• Differences in the results with single vs double precision $\to$ roundoff errors.

We can summarize the above results as follows:

• Taylor series: Methodological errors are zero in the whole range of $x$, roundoff errors grow rapidly with increasing $x$.

The calculation with Taylor series gives results with at least 7 exact digits up to $\sim$ $x=1$.

• Continued Fraction: Methodological errors decrease with increasing $x$, the roundoff error is very small in the whole range of $x$ considered.

### Example of stable and unstable algorithms

Many numerical methods are based on recursion formulas of the type:

$$$y_{n} = a \cdot y_{n-1} + b \cdot y_{n-2} \qquad n=2,3,\ldots \quad .$$$

When employing such formulas, it is important to consider carefully the stability of the algorithm. An algorithm is defined as stable (or unstable), when the error with respect to the exact result at the $n$-th step of the calculation decreases (or increases) in the following steps.

A very instructive example in this context is the numerical calculation of spherical Bessel functions through a "forward recursion":

$$$j_{o}(x)=\frac{\sin x}{x} \qquad j_{1}(x)=\frac{\sin x}{x^{2}} - \frac{\cos x}{x}$$$ $$$j_{l}(x)=\frac{2l-1}{x} \cdot j_{l-1}(x) - j_{l-2}(x)$$$


The numerical evaluation of this formula obviously involves no methodological error; the only possible source of error are roundoff errors! To evaluate the magnitude of the roundoff error, we can compare the results calculated in single and double precision (see table 1.3).

Tab.1.3: Numerical calculation of the spherical Bessel functions of order 0-9 through forward recursion.


single prec.                  double prec.
x =  0.3000
L=  0    0.9850674E+00             0    0.9850674D+00
1    0.9910274E-01             1    0.9910289D-01
2    0.5960023E-02             2    0.5961525D-02
3    0.2309625E-03             3    0.2558598D-03
4    -.5708978E-03             4    0.8536426D-05
5    -.1735790E-01             5    0.2329701D-06
6    -.6358853E+00             6    0.5811086D-08
7    -.2753767E+02             7    0.1884359D-07
8    -.1376248E+04             8    0.9363682D-06
9    -.7795982E+05             9    0.5304202D-04
x =  1.0000
L=  0    0.8414710E+00             0    0.8414710D+00
1    0.3011687E+00             1    0.3011687D+00
2    0.6203508E-01             2    0.6203505D-01
3    0.9006739E-02             3    0.9006581D-02
4    0.1012087E-02             4    0.1011016D-02
5    0.1020432E-03             5    0.9256116D-04
6    0.1103878E-03             6    0.7156936D-05
7    0.1332998E-02             7    0.4790142D-06
8    0.1988459E-01             8    0.2827691D-07
9    0.3367050E+00             9    0.1693277D-08
x = 10.0000
L=  0    -.5440211E-01             0    -.5440211D-01
1    0.7846694E-01             1    0.7846694D-01
2    0.7794219E-01             2    0.7794219D-01
3    -.3949584E-01             3    -.3949584D-01
4    -.1055893E+00             4    -.1055893D+00
5    -.5553451E-01             5    -.5553451D-01
6    0.4450132E-01             6    0.4450132D-01
7    0.1133862E+00             7    0.1133862D+00
8    0.1255780E+00             8    0.1255780D+00
9    0.1000964E+00             9    0.1000964D+00
x = 20.0000
L=  0    0.4564726E-01             0    0.4564726D-01
1    -.1812174E-01             1    -.1812174D-01
2    -.4836553E-01             2    -.4836552D-01
3    0.6030358E-02             3    0.6030359D-02
4    0.5047615E-01             4    0.5047615D-01
5    0.1668391E-01             5    0.1668391D-01
6    -.4130000E-01             6    -.4130000D-01
7    -.4352891E-01             7    -.4352891D-01
8    0.8653319E-02             8    0.8653319D-02
9    0.5088423E-01             9    0.5088423D-01


The results in Table 1.3 clearly show that this method is strongly unstable, in particular for small arguments $x$. This instability leads to a complete breakdown of the method for increasing order of the Bessel function.

On the other hand, the stability of the method sensibly improves for larger values of $x$.

In order to obtain meaningful results also for small values of the argument, we have to switch from the "forward recursion" algorithm we used so far to a "backward" recursion:

$$$j_{L+1}(x)=0 \qquad j_{L}(x)=\delta$$$ $$$j_{l}(x)=\frac{2l+3}{x} \cdot j_{l+1}(x) - j_{l+2}(x)$$$


In these expression, $j_{l_{max}}(x)=\delta$ is an arbitrary (small) number, which is the starting value for the backward recursion. The natural number $l_{max}$ must be larger than the order $l$ to be calculated.

Even though the starting value of the recursion is arbitrary, the values converge rapidly to a number series $\alpha j_{l}$ for decreasing $l$. The initially unknown constant $\alpha$ can be estimated normalizing the value for $l=0$ to the quantity $\sin x/x$.

Due to the arbitrariness of $\delta$, the results obtained in this way are affected by a methodological error, which is smaller, the bigger the starting index $l_{max}$.

To estimate $\epsilon_{V}$, we have repeated the calculation twice, i.e. for $l_{max}=18$ and $l_{max}=27$; in both cases we have performed calculations both in single and double precision.

The results of these tests are summarised in the table below.

             single precision                   double precision
backwards(18)   backwards(27)      backwards(18)   backwards(27)
x =  0.3000
L=  0   0.9850674E+00   0.9850674E+00      0.9850674D+00   0.9850674D+00
1   0.9910290E-01   0.9910290E-01      0.9910289D-01   0.9910289D-01
2   0.5961526E-02   0.5961525E-02      0.5961525D-02   0.5961525D-02
3   0.2558598E-03   0.2558598E-03      0.2558598D-03   0.2558598D-03
4   0.8536426E-05   0.8536425E-05      0.8536426D-05   0.8536426D-05
5   0.2329583E-06   0.2329583E-06      0.2329583D-06   0.2329583D-06
6   0.5378445E-08   0.5378444E-08      0.5378444D-08   0.5378444D-08
7   0.1076069E-09   0.1076069E-09      0.1076069D-09   0.1076069D-09
8   0.1899475E-11   0.1899474E-11      0.1899474D-11   0.1899474D-11
9   0.2999847E-13   0.2999847E-13      0.2999847D-13   0.2999847D-13
x =  1.0000
L=  0   0.8414710E+00   0.8414710E+00      0.8414710D+00   0.8414710D+00
1   0.3011687E+00   0.3011687E+00      0.3011687D+00   0.3011687D+00
2   0.6203505E-01   0.6203505E-01      0.6203505D-01   0.6203505D-01
3   0.9006580E-02   0.9006579E-02      0.9006581D-02   0.9006581D-02
4   0.1011016E-02   0.1011016E-02      0.1011016D-02   0.1011016D-02
5   0.9256115E-04   0.9256114E-04      0.9256116D-04   0.9256116D-04
6   0.7156936E-05   0.7156935E-05      0.7156936D-05   0.7156936D-05
7   0.4790134E-06   0.4790133E-06      0.4790134D-06   0.4790134D-06
8   0.2826499E-07   0.2826498E-07      0.2826499D-07   0.2826499D-07
9   0.1491376E-08   0.1491376E-08      0.1491377D-08   0.1491377D-08
x = 10.0000
L=  0   -.5440211E-01   -.5440211E-01      -.5440211D-01   -.5440211D-01
1   0.7846695E-01   0.7846695E-01      0.7846695D-01   0.7846694D-01
2   0.7794219E-01   0.7794220E-01      0.7794220D-01   0.7794219D-01
3   -.3949586E-01   -.3949586E-01      -.3949585D-01   -.3949584D-01
4   -.1055893E+00   -.1055893E+00      -.1055893D+00   -.1055893D+00
5   -.5553451E-01   -.5553451E-01      -.5553451D-01   -.5553451D-01
6   0.4450133E-01   0.4450133E-01      0.4450133D-01   0.4450132D-01
7   0.1133862E+00   0.1133862E+00      0.1133862D+00   0.1133862D+00
8   0.1255780E+00   0.1255780E+00      0.1255780D+00   0.1255780D+00
9   0.1000964E+00   0.1000964E+00      0.1000964D+00   0.1000964D+00
x = 20.0000
L=  0   0.4564727E-01   0.4564726E-01      0.4564726D-01   0.4564726D-01
1   -.8801447E-01   -.1812270E-01      -.8801444D-01   -.1812269D-01
2   -.5884944E-01   -.4836567E-01      -.5884943D-01   -.4836567D-01
3   0.7330211E-01   0.6031280E-02      0.7330208D-01   0.6031273D-02
4   0.8450518E-01   0.5047661E-01      0.8450516D-01   0.5047661D-01
5   -.3527479E-01   0.1668320E-01      -.3527476D-01   0.1668320D-01
6   -.1039063E+00   -.4130086E-01      -.1039063D+00   -.4130085D-01
7   -.3226432E-01   -.4352875E-01      -.3226432D-01   -.4352875D-01
8   0.7970807E-01   0.8654292E-02      0.7970804D-01   0.8654284D-02
9   0.1000162E+00   0.5088490E-01      0.1000161D+00   0.5088490D-01


Numerical calculation of spherical Bessel function of order 0-9 with backwards recursion.

It appears that the backward recursion represents a very stable algorithm for the whole range of $x$ considered: the results with single and double precision differ by at most two units in the seventh digit of the mantissa. Furthermore, we see that the methodological error increases with increasing values of the argument $x$.

• Summary the Bessel function calculations:
• For not too large arguments $x$ (for $l=9$ up to $\sim x=10.$ ) it is definitely better to employ the backward recursion, due to its higher stability.
• For the large-$x$ range (for $l=9$ and from $x=10.$) it is preferrable to employ the forward recursion which is not affected by methodological error.