|
Article on other languages:
|
The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. The standard defines formats for representing floating-point numbers (including negative zero and denormal numbers) and special values (infinities and NaNs) together with a set of floating-point operations that operate on these values. It also specifies four rounding modes and five exceptions (including when the exceptions occur, and what happens when they do occur).
SummaryIEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (≥ 43-bit, not commonly used) and double-extended precision (≥ 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard; the others are optional. Many languages specify that IEEE formats and arithmetic be implemented, although sometimes it is optional. For example, the C programming language, which pre-dated IEEE 754, now allows but does not require IEEE arithmetic (the C float typically is used for IEEE single-precision and double uses IEEE double-precision). The full title of the standard is IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), and it is also known as IEC 60559:1989, Binary floating-point arithmetic for microprocessor systems (originally the reference number was IEC 559:1989).[1] Later there was an IEEE 854-1987 for "radix independent floating point" as long as the radix is 2 or 10. In June 2008, a major revision to IEEE 754 and IEEE 854 was approved by the IEEE. See IEEE 754r. Structure of a floating-point numberFollowing is a description of the standard's format for floating-point numbers. Bit conventions used in this articleBits within a word of width W are indexed by integers in the range 0 to W−1 inclusive. The bit with index 0 is drawn on the right. The lowest indexed bit is usually the lsb (Least Significant Bit, the one that if changed would cause the smallest variation of the represented value). General layoutBinary floating-point numbers are stored in a sign-magnitude form where the most significant bit is the sign bit, exponent is the biased exponent, and "fraction" is the significand without the most significant bit. Exponent biasingThe exponent is biased by (2e − 1) − 1, where e is the number of bits used for the exponent field (e.g. if e=8, then (28 − 1) − 1 = 128 − 1 = 127 ). See also Excess-N. Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder. To solve this, the exponent is biased before being stored by adjusting its value to put it within an unsigned range suitable for comparison. For example, to represent a number which has exponent of 17 in an exponent field 8 bits wide: CasesThe most significant bit of the significand (not stored) is determined by the value of exponent. If 0 < exponent < 2e − 1, the most significant bit of the significand is 1, and the number is said to be normalized. If exponent is 0, the most significant bit of the significand is 0 and the number is said to be de-normalized. Three special cases arise:
This can be summarized as:
Single-precision 32-bitA single-precision binary floating-point number is stored in 32 bits. The exponent is biased by 28 − 1 − 1 = 127 in this case (Exponents in the range −126 to +127 are representable. See the above explanation to understand why biasing is done). An exponent of −127 would be biased to the value 0 but this is reserved to encode that the value is a denormalized number or zero. An exponent of 128 would be biased to the value 255 but this is reserved to encode an infinity or not a number (NaN). See the chart above. For normalized numbers, which are the most common, the exponent is the biased exponent and fraction is the significand without the most significant bit. The number has value v:
or
Where
In the example shown above, the sign is zero so s is +1, the exponent is 124 so e is −3, and the significand m is 1.01 (in binary, which is 1.25 in decimal). The represented number is therefore +1.25 × 2−3, which is +0.15625. Notes:
Here is the summary table from the previous section with some 32-bit single-precision examples:
Range and Precision Table:Some example range and precision values for given exponents:
Note that 16,777,217 can not be encoded as a 32-bit float (it will be rounded to 16,777,216) A more complex exampleThe decimal number −118.625 is encoded using the IEEE 754 system as follows:
Double-precision 64 bitDouble precision is essentially the same except that the fields are wider: The fraction part is much larger, while the exponent is only slightly larger. NaNs and Infinities are represented with Exp being all 1s (2047). If the fraction part is all zero then it is Infinity, else it is NaN. For Normalized numbers the exponent bias is +1023 (so e is exponent (− 1023)). For Denormalized numbers the exponent is (−1022) (the minimum exponent for a normalized number—it is not (−1023) because normalised numbers have a leading 1 digit before the binary point and denormalized numbers do not). As before, both infinity and zero are signed. Notes:
Comparing floating-point numbersEvery possible bit combination is either a NaN or a number with a unique value in the affinely extended real number system with its associated order, except for the two bit combinations negative zero and positive zero, which sometimes require special attention (see below). The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply. Floating-point arithmetic is subject to rounding that may affect the outcome of comparisons on the results of the computations. Although negative zero and positive zero are generally considered equal for comparison purposes, some programming language relational operators and similar constructs might or do treat them as distinct. According to the Java Language Specification,[3] comparison and equality operators treat them as equal, but Math.min() and Math.max() distinguish them (officially starting with Java version 1.1 but actually with 1.1.1), as do the comparison methods equals(), compareTo() and even compare() of classes Float and Double. For C++, the standard does not have anything to say on the subject, so it is important to verify this (one environment tested treated them as equal when using a floating-point variable and treated them as distinct and with negative zero preceding positive zero when comparing floating-point literals). Rounding floating-point numbersThe IEEE standard has four different rounding modes; the first is the default; the others are called directed roundings.
Extending the real numbersThe IEEE standard employs (and extends) the affinely extended real number system, with separate positive and negative infinities. During drafting, there was a proposal for the standard to incorporate the projectively extended real number system, with a single unsigned infinity, by providing programmers with a mode selection option. In the interest of reducing the complexity of the final standard, the projective mode was dropped, however. The Intel 8087 and Intel 80287 floating point co-processors both support this projective mode.[4][5][6] Recommended functions and predicates
See also
References
Further reading
External links
Advanced questions for article: ieee754, Questions for article: ieee754 einfache genauigkeit umrechnung beispiel, ieee754 umrechnung beispiel einfache genauigkeit, 754r, ieee 754 double umrechnung applet, ieee 754r, ieee754, ieee754 kalkultor, ieee754,kayan noktal, ieee754浮點數表示法 64位元, ieee_754, ieee_754 umrechnung -java -applet |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License.
Mercedes Car
This site monitored by SitePinger.net