Floating-Point-Roundoff Accumulation in Digital-Filter Realizations
01 October 1967
The difference equation Af v Wn = t>kXn-k - J2 dkWn-k , n ^ N (1) k=0 Jfc=l with M ^ N defines the behavior of a general time-invariant discrete filter which acts on an input sequence x0 , xt , x2, · · · to produce an output sequence wN, wx+1 , wx+2, ··· that depends on the starting values wQ , wx , · · · , . There is a vast literature concerned with techniques for designing discrete filters [i.e., for determining the a k and the b k in (1)] to meet specifications of various types (see, for example, Refs. 1, 2, and 3), and a good deal of material is available on the subject of roundoff effects in fixed-point realizations of discrete filters (see, for instance, Refs. 4 and 5). In this paper, we derive some bounds on a meaningful measure of the overall effect of roundoff errors for discrete filters realized as digital filters on a machine employing floating-point arithmetic operations. This type of realization, as opposed to the fixed-point kind, is of particular importance in connection with, for example, digital computer simulations of systems, as a result of the large dynamic range afforded by the floating-point mode. There are basic differences concerning fixed-point and floating-point error estimation problems which stem from the fact that the modulus of every individual arithmetic error in the fixed-point mode is bounded by a constant determined by the machine, whereas the maximum modulus of the error in forming, for example, the floating-point sum of two floating-point numbers is proportional to the magnitude of the true sum.