Fixed Point Format and Floating Point Format Examples

  • • It is found that in the floating-point representation we can cover a much larger range of numbers than that is possible in the fixed-point representation.

    • In the floating-point representation the resolution decreases with an increase in the size of the range; this means that the distance between two successive floating-point numbers increases.

    • In the floating-point scheme, resolution is variable within the range. However, for the fixed-point format, resolution is fixed and uniform.

    • This variability in resolution provides a large dynamic range of the numbers.

    To the ideas given above can be illustrated, by considering case of a 16-bit computer.

    Example 35: Consider a 16-bit computer. Obtain the dynamic range and resolution when the computer is operated (a) fixed-point format and (b) in the floating-point format.


    (a) For the 16-bit computer, with one bit reserved for representing the sign, the highest positive and negative numbers that can be represented in the fixed-point format are -(2m-1 - 1) and  (2m-1 - 1), respectively, where m = 16. Substituting for m = 16 yields:

    The highest positive number  = 216-1 - 1 = 215-1 = 32767
    The highest negative number = -(216-1 - 1) = -(215-1) = -32767

    This means that we can represent all the whole numbers from -32,767 to + 32,767. Since the numbers represented are of the form -32,767, -32,766, -32,765, …, 32,766, and 32,767 (i.e., successive numbers differing by one digit), we find that in this scheme:

                                        Resolution = 1

    In this scheme, we also find that we can represent only whole numbers; we can not represent fractions.
    Now, suppose we want to express fractions also through this scheme. For this, let us reserve 5 bits to represent the fractional part, 10 bits to represent the integer part, and 1 bit to represent the sign of the mantissa. In the fixed-point representation, the given number can now be written as

                            X  = ± (210 - 1) ´ 2-5 = ± 31.96875

    Thus the range in this case will be between -31.96875 and +31.96875. We have:

                                     Resolution = 2-5  = 0.00001

    We find that in this case, the range (sometimes called the dynamic range) has been considerably decreased, but resolution has been greatly increased.

    (b) Floating-Point Format

    In the floating-point format, we reserve 5 bits to represent the exponent, 1 bit to represent its sign, 9 bits to represent the mantissa part, and 1 bit to represent its sign. Table 1.38 shows the floating-point representation of the given number.

    Table 1.38  Floating-point representation in a 16-bit computer
    M (9 bits)
    E (5 bits)
    0.1 0 0 0 0 0 0 0 0
    1 1 1 1 1
    0.1 0 0 0 0 0 0 0 0
    1 1 1 1 1

    In Table 1.38, SM  represents the sign bit of mantissa, M represents the mantissa, SE represents the sign bit of exponent, and E represents the exponent. We find that row 2 shows the smallest bit that can be represented in this format. This is obtained as follows:

    We have, for the smallest possible number in this scheme, mantissa = 0.5, which is the minimum possible value of mantissas in the floating-point scheme, as stated earlier. This is then represented under the column M as .1 followed by eight 0s (i.e., .100000000). The sign of the mantissa is taken as positive and is represented by a 0 under the column SM. Since the exponent has 5 bits to represent it, the highest possible number in this case will be 25 = 32. Using this as the exponent, we get the smallest number in our present case as

    We find that this is achieved by sacrificing uniformity in resolution. Notice that in the floating-point format, compared to larger numbers, whose resolution is coarse, small numbers have finer resolution.

    The IEEE 754 Standard (Floating-point format) for 32-bit Machines

    IEEE 754 standard for floating-point arithmetic in 32-bit computers is shown in Table 1.39.

    Table 1.39 IEEE 754 standard for 32-bit machines

    Sign (S)
    Exponent (E)
    Mantissa (M)
    1                 8
    9                     31

    In this scheme, we have 23 bits reserved for representing the mantissa, one bit for the sign of the mantissa (S), 7 bits for the exponent, and one bit for the sign of the exponent. The maximum number in this scheme is


    Binary Fractions and Floating Point

  • Expressing binary fractions in the floating-point format may also be done using the same method we have developed above. We shall illustrate the technique by using a numerical example.

    Example 34: Express decimal fraction ¼ in the binary floating-point format.

    Solution: As the first step, we convert the given decimal fraction into regular binary fraction. The conversion yields

    1/4 = 0.25 = 0. 010

    Now, since the given number is a fraction, we employ the reverse of our previous technique, i.e., first multiply and then divide (instead of first dividing and then multiplying) for its floating-point representation. Also, since there are only two bits in the given fraction 0.01 (the 0 after the 1 is not counted), it seems quite natural that we use 22 for multiplying and dividing it. However, we have imposed a restriction on the mantissa M that for binary floating-point format it should lie in between ½ and 1. Imposition of this restriction means that we have to express the mantissa as 0.1, and not as 0.01. This further means that we have to multiply and divide the number by 21, and not by 22. Performing this operation yields

    X = (0.01 x 21)(1/21) = 0.1 x 2-1

    Now, the exponent of X, which is a negative integer, must also be expressed in binary,as stated above. Carrying out this exercise, our floating-point representation of decimal fraction ¼ becomes

    X = (0.01 x 21)(1/21) = 0.1 x 21001

    where the first 1 in the exponent represents the negative sign, and the next three bits, viz. 001, represent decimal 1.


    Floating Point Representation Examples

  • Consider the decimal number 468. We can express this in the form

    468 = 468/1000 = 0.468 ´ 103

    Here, we first divided and then multiplied 468 by 1000 so that its value does not change. In this process, the division by 1000 had converted 468 into a fraction, and the multiplication by 1000 ensured that its value remained unchanged at 468 itself. Thus, in our example, we expressed 468 as the product of a fractional part (0.468) and an exponent part (103). This type representation of numbers is called the floating-point representation. The fractional (decimal part) is called as mantissa, and the exponential part is called as exponent. Thus to express a given decimal number in the  floating-point format, the steps to be followed are:

    1.    First, divide the given number by an appropriate power of 10 (radix of the decimal-number system) so that it is converted into a fraction.
    2.   Multiply the resulting fraction with the same power of 10 (103, here) so that division is cancelled by multiplication; this brings the number back to its original value (468, here).

    The steps given above may be extended to the binary-number system also to represent a given number in the floating-point format.

    Example 33:  Express decimal number 7 in the binary floating-point format.

    Solution: Consider decimal number 7. The binary equivalent of 7 is 111. To express this in the floating-point format, we divide 111 with an appropriate power of 2. Since there are three bits in the given number, extending our theory from the decimal system given above, we have to divide and multiply 111 by 23. Then X can be written in the form
                                                             X = (111/23) × 23 = 0.111× 23   

          In the binary floating-point format, we must express the exponent also in binary. The binary equivalent of decimal 3 is 011. As this is a positive exponent, we use sign bit 0 in the first bit position of the exponent Thus the complete floating-point representation of decimal number 7 is:
                                                             X = 0.111× 20011                       
    To check whether our operation has yielded the correct answer, we expand the above relation

                                           X = 0.111 x 23 = (1 x 2-1+1 x 2-2 + 1 x 2-3) x 23 = 7

    The result of this checking operation shows the correctness of our method. Now, we generalize the floating-point method with the expression
                                             X = M ´ RE                   (1.19)

    where M = mantissa, R = radix of the number system used, and E = exponent. Now, to express X in  the  floating-point scheme  (when it is a whole number), we  multiply and divide X by RE> Thus

                                                                  X = (X/ RE) x RE

    where E is dependent on the number of bits in X. From the above, we find

                                                                       M = X/ RE                       (1.20)

    In the example using decimal number 468, we had X = 468, R = 10, E = 3, and therefore M = 0.468.
    Let us now consider the binary floating-point scheme again. Even though, we can use any number as the mantissa, in the binary floating-point format, a restriction has been imposed on it: it should lie between ½ and 1. That is
                                                                  0.5 ≤ M ≤ 1                         (1.21)

    This restriction imposes the condition that the first bit after the binary point must be a 1.


    Fixed Point Representation in Digital Electronics

  • Representation of Binary Numbers

    Binary numbers are represented in digital systems in one of the following two formats:

    ·         Fixed-point representation
    ·         Floating-point representation

    The Fixed-Point Representation

    Consider the decimal number
                                                                X = 10587.349
    This type of representation of a number as a string of digits with the decimal point in between two smaller strings (or groups) of digits is called as fixed-point representation. The group of digits to the left of decimal point is called as integer part, and those to the right of the decimal point is called as fractional (decimal) part.
    We can represent binary numbers also in fixed-point representation using a similar format. For example, the binary number
                                        Y = 101001.01011
    is written in the fixed-point format. Here, the point symbol is called as binary point (similar to decimal point). As in the case of the decimal-number representation, to the left of the binary point, we have the integer part, and to its right, the fractional part.

    The idea of using binary points to represent fractional parts exists only in theory; in practical systems, however, they are not represented in this format. Instead, we designate some locations in the memory of the computer to represent the integer part, and some other locations to represent the fractional part.
          We may generalize the representation of numbers in the fixed-point arithmetic by expressing the number X in the format.
    The most significant bit 0 indicates that the number represented in Eq. (1.13) is a positive number. Similarly, the general expression for representing negative numbers in the fixed-point arithmetic is:
    The general expression given in Eq. (1.14) may be modified in one of the three formats given below to represent negative numbers in the fixed-point arithmetic.

    Representation of Negative Numbers and Fractions

    Negative numbers or fractions are represented in binary systems in the following formats:

    ·         Sign-magnitude format
    ·         One’s-complement format
    ·         Two’s-complement format

    The Sign-Magnitude Format

    In digital computers, we have to represent both positive and negative numbers. For this, we use an additional bit, called the sign bit, in the most significant bit (MSB) position. We use 0 to represent positive and 1 to represent negative numbers in the MSB position. This scheme of representing numbers is known as the sign-magnitude format. For example, consider the decimal number 4. Its binary equivalent is 0100. In the sign-magnitude format, we can express this number in the form

    +4 = 00100
    -4 = 10100

    In the above example, we have represented whole number in the sign-magnitude format. We may also require sign-magnitude representation of fractions. Consider the binary fraction

      X = -0.11010001

    This can be represented in the sign-magnitude format as

     X = 1.11010001

    where the 1 to the left of the binary point (i.e., the MSB) indicates that the binary fraction following this bit is negative. The general expression with which negative numbers are represented in the sign-magnitude format can now be written as

    In the sign-magnitude scheme, multiplication of two numbers is a relatively straightforward process and does not involve the usage of any special algorithm. This is the advantage of this scheme. But as far as addition is concerned, this scheme requires a more complex procedure to be followed in which operations such as sign checks, complementing, and carry-generation are involved.   

    The One’s-Complement Format

    In the 1’s-complement format, negative numbers are represented as 1’s-complement of the given number. For example, consider the binary number

    X = -0.11010001

    By changing 1s to 0s and 0s to 1s, we get its 1’s-complement as

    Y = 1.00111110

    The general expression with which negative numbers are represented in the one’s-complement format can now be written as
    where `bm is the 1’s complement of the number bm.

    It is easy to multiply two 1’s-complement numbers, as this is a straightforward procedure. But addition requires special algorithms. Even though 1’s-complement scheme can be used for mathematical processing in digital systems, because of its obvious advantages in carry-generation, 2’s-complement scheme is used in majority of signal-processing computers that employ fixed-point arithmetic.

    The Two’s-Complement Format

    We have seen that 2’s-complement of a given number is obtained by adding a 1 to the LSB (least-significant bit) of 1’s complement of that number. For example, consider the binary number

    X = 0.1101 0001
    1’s-complement of -0.1101 0001
    X′ = 1.0010 1110

    Now adding a 1 to X′ yields its 2’s complement:

    Z = 1.00101111
    The general expression with which negative numbers are represented in the 2’s-complement format can now be written as

    where bm is the 1’s complement of the given number bm. The symbol Å represents modulo-2 addition of 1 with the 1’s complement of the given number. Modulo-2 addition is used, so that the carry generated in the sign bit can be ignored. It can be seen from practical examples that the carry generated in the sign bit has to be ignored; then only the mathematical operations employing 2’s-complement will produce correct results.

    Range of Numbers in the Fixed-Point Arithmetic

    Consider the 4-bit representation of binary numbers in the fixed-point arithmetic. We find that in this scheme since one bit has to be reserved for the sign bit, we can represent a maximum of 23 (= 8)  positive numbers. The eight positive numbers are 0.000 to 0.111. Similarly, we may represent a maximum of 23 (= 8) negative numbers in this scheme. The negative numbers are 1.001 to 1.111. Thus the total number of positive and negative numbers in this case is sixteen.

    We may now obtain a general expression for the range of numbers in the fixed-point format. Let the number of bits that the given computer can accommodate be m. Of these m bits, we have to keep one bit ready for the representing the sign. The remaining (m − 1) bits can be used to represent the numbers. This means that we can represent a total of 2m‒1 positive numbers, and 2m‒1 negative numbers. Thus in this scheme, the range of the numbers

    R = -(2m‒1-1) to (2m‒1-1)                                      (1.18)

    For example, consider the case of a 64-bit computer. With fixed-point format, the numbers that can be represented by it lie between 

     -(263 -1) and (263 -1)

    Even though this appears to be a large range of numbers, in reality, it is not so. It can be seen that the floating-point scheme can accommodate much larger numbers than the fixed-point scheme.
    In the fixed-point arithmetic, to restrict the maximum and minimum numbers so that they can be stored and processed in registers of finite lengths, we require two operations known as truncation and round off. Truncation is the operation of abruptly terminating a given number at some desired significant-bit (or digit) position. Round off (or simply, rounding) is the operation of approximating the truncated number at the same or next higher value. These are the same terms that we have been using in our day-to-day arithmetic. For example, consider the product
                                                              3.256789 ´ 7 = 22.797523

    Now, let us truncate this number to three significant digits after the decimal point. This means that we have to terminate the number abruptly at the third digit from the decimal point. In truncation operation, we stop at the desired digit, and do not look for the digit that comes after the truncated one. Using this theory, we find that the truncated number in our example is 22.797.

    However, we know that abrupt truncation is not a desired operation. In practice, for a better approximation of the given number, we always look for next digit after the truncated digit. If this digit is greater than or equal to 5, we add a 1 to the truncated digit; if it is less than 5, we discard it.  This operation is called rounding off. In our example, we had truncated the given number at the third digit after the decimal point (i.e., at 22.797). We now look for the fourth digit in the number, and find that it is a 5. Therefore, as per our round-off theory, we have to add a 1 to the truncated digit, and this operation makes the given number equal to 22.798 in the combined operations of truncation and rounded-off.

    The principle of truncation and round off can be extended to binary number system also. We find that in binary system, for rounding off, we add a 1 to the truncated bit if the next bit after the truncated bit is a 1; if it is a 0, we leave the truncated bit as such without any modification. Consider, for example, the binary number 11101101001. We want this to be truncated at the third bit after the binary point. Performing this operation, we get the truncated number as 1110.110.

    To round off the number at this point, we find that the next bit (i.e., the fourth bit after the binary point) is a 1. So, we add this to the third bit and the truncated and rounded-off number will be 1110.111. However, if the fourth bit were a 0, then the third bit will be left as such. This means that the truncated number 1110.110 itself will be the rounded-off number.          


    Subtraction using 1's Complement & 2's Complement

  • Subtraction Using 1’s Complement (Indirect Subtraction)

    We can perform subtraction by adding minuend to the complement of subtrahend and doing some further manipulations in the sum so obtained. The steps involved in this procedure are:

    1. Find the 1’s complement of subtrahend. For this, in the decimal system, subtract the   subtrahend from 9 if it is a single-digit number, 99 if it is a double-digit number, and so on. In binary system, complementary number can be obtained by changing 0s to 1s and 1s to 0s.
    2. Next, add minuend and complement of subtrahend. This is called as 9’s complement addition in decimal system and 1’s complement in binary system.
    3. Finally, add the most significant digit (or, bit) in the sum to the least significant digit (or, bit) in the remaining portion of the number from which the MSD is removed to obtain the resulting subtracted number.

    Example 30: Subtract decimal number 4 from decimal number 7.

    Solution: Following step 1, we find the 9’s complement of 4, which is 9 ‒ 4 = 5. Next, add 7 and 5 to yield 12. Finally, add MSD 1 to LSD 2 to get 3. We find that this is the desired result. 

    Example 31: Subtract binary number 100 from binary number 111.

    Solution: To solve this problem, we use the steps given in Section 1.17.2.
             1. 1’s complement of 100  =  011
             2. 111 + 011 = 1010
             3. Removing MSB 1 and adding it to the remaining portion of 010 in 1010, we get

     Desired sum = 010 + 1= 011
     We find that 011 is the desired result.

    Subtraction using 2’s Complement

    Step 3 of Section (Subtraction using 1's Complement) requires that for binary subtraction using 1’s complement, shifting and addition of MSB with LSB has to be performed. Binary subtraction can also be performed using 2’s complement.

    2’s complement of a given number is obtained by adding binary 1 to its 1’s complement. For example, we know that 1’s complement of 101 is 010. Adding 1 to 010 yields the 2’s complement of 101 as 010 + 1= 011. The steps involved in binary subtraction using 2’s complement are:

    1. Find the 1’s complement of subtrahend.
    2. Add 1 to the 1’s complement to get the 2’s complement of subtrahend.
    3. Add minuend and 2’s complement of subtrahend.
    4. Discard the MSB to get the desired difference.
    Example 32: Subtract (100)2 from (111)2 using 2’s complement.

    Solution:  To solve this problem, we use the steps given in Section 'Subtraction using 2's Complement'.
     1. 1’s complement of 100 = 011.
     2. 1 + 011 = 100.
     3. 111 + 100 = 1011
     4. Discard the MSD 1 from 1011. We the get
         Difference = 011

    This is the desired result.


    Binary Subtraction with Examples

  • Binary Subtraction with Examples:

    Binary subtraction is performed in the same way we perform decimal subtraction. There are two methods for decimal and binary subtraction. In the first method, we use direct subtraction. In the second method, we use an indirect subtraction method called complement’s method. We first discuss the direct subtraction.

    Direct Subtraction

    In this scheme, we subtract a smaller number from a larger number. This is illustrated using an example of decimal subtraction.

    Example 28: Subtract decimal number 657 from decimal number 725.

    Solution: To subtract, we write the numbers one below the other as shown in Table 1.33. Then we first subtract the least-significant digit (LSD) 7 from LSD 5. This subtraction is not possible. Therefore, we borrow a 1 from digit 3 in the second bit position and transfer it to LSD position. When this reaches LSD, it gains a weight of 10 and we add this to LSD 5 to make it 10+5 = 15. We can now subtract 7 from 15 to yield the difference 8 in the LSD position.

    We then move to the second digit position of 3. Since a 1 was borrowed from this place to LSD, we have only a 2 remaining in this position. We now try to subtract 5 from 2 and this is not possible. Therefore, as in the previous case, we borrow a 1 from the most significant digit (MSD) position and add the new weight of 10 to 2 to make it 12. We can subtract 5 from 12 to yield the difference of 7 in the second digit position, as shown in Table 1.34.

    Finally, we notice that when a 1 is borrowed from the MSD, the digit remaining there would be 6 and 6 ‒ 6 = 0 as the difference in MSD.

    The steps described above can be used for direct subtraction of smaller numbers from larger numbers. Hence, we will adopt the same procedure to perform binary subtraction. As in the case of binary addition, for subtraction also we prepare subtraction tables. Table 1.35 shows the 2-bit subtraction, and Table 1.36 shows the 3-bit subtraction.

    Example 29: Subtract binary number 110 from binary number 1110.

    Solution: To subtract, we write the numbers one below the other as shown in Table 1.37.