Smallest positive single precision IEEE-754 number - floating-point

I'm trying to find the smallest positive single precision IEEE-754 number S, so that 1+S =/= 1.
So far I have arrived at 0 00000000 00000000000000000000001, which is a denormalized number (indicated by 0 as an exponent). How do I compute 1+S (0 01111111 00000000000000000000000 + 0 00000000 00000000000000000000001)? I have problems using the formula (mA * 2^(eA-eB) + mB) * 2^eB, where mA/mB are the mantissas and eA/eB are the corresponding exponents.

Related

Normalize and denormalized arithmetic

I have two numbers in similar standard like IEEE754 (1 for sign, 8 for exponent, 7 for multiplier), first is denormalized and second one is normalized. How can I add/divide them?
S E M decimal binary
0 00000000 1100000 = 2^(-127) * 0,1100000
0 11000000 0000001 = 2^65 * 1,0000001
BIAS = 2^(8-1)-1 = 127

How is the fractional part of the mantissa represented?

When many people try to explain floating-point number representation, the break it down into three parts:
Sign bit indicating negative or positive
Exponent indicating the scale (e.g. 8)
Mantissa indicating the base (e.g. 1.2345)
I understand that these values are treated as integers in a single 64 bit address (for doubles). What I have not seen explained is how you would represent a mantissa of 1.2345 in binary format when computers know nothing about the decimal separator or where it should be "placed" within the mantissa.
I am looking for a complete step-by-step explanation of how I would construct a (32 bit if you want) floating point representation of a decimal number and vice versa.
The following assumes a binary format.
Except for a few special cases, (0 and subnormal numbers), the significand m (the technical name: a mantissa is a similar but slightly different concept) is in the interval 1 &leq; m < 2, so the leading digit doesn't need to be stored (since it is always 1).
The remaining bits give the fractional part, stored as binary. You can think of this as subtracting off decreasing powers of 2:
0.2345 < 2-1, so the first bit is 0
0.2345 < 2-2, so the second bit is 0
0.2345 ≥ 2-3, so the third bit is 1
0.2345 - 2-3 = 0.1095
0.1095 ≥ 2-4, so the 4th bit is 1
0.1095 - 2-4 = 0.047
0.047 ≥ 2-5, so the 5th bit is 1
0.047 - 2-5 = 0.01575
etc. continuing this process gives you the expansion
00111100000010000011000100100110111010010111100011010100111111...
(from Rick Regan's decimal/binary converter)
For a double this is rounded to 52 bits (or 53 if you count the implicit leading 1):
0011110000001000001100010010011011101001011110001101
Actually, decimal point is not stored in number's float or double representation. You are right about three parts of floating point number representation, but there are some important nuances:
First bit indicates sign - 0 for positive value, 1 for negative.
Exponent is in base 2 instead of base 10, so that 3 means 23 instead of 103. In case of 32-bit floats we have 8 bits for exponent, with 10000000 assumed as zero, this gives us exponents between -128 and 127 (we can encode numbers from 2-128 to 2127 using them).
Mantissa (or significand) is stored in scientific notation, which means it is between [0, 1), or greater or equal to 0 and less than 1. This way we don't have to store 0. part, i.e. 0.1234 can be stored as 1234.
Now, how the above is stored in actual floating point number. Let's have a look at example, say, we have a number x = 123.456.
It is positive, so the sign bit is 0.
Now we look for y - smallest power of 2, so that 0 <= x/y < 1. In our case it is 128 (or 27): 123.456 / 128 = 0.9645, which gives us exponent equal to 7 or 00000111 in binary.
Now we have to encode 0.9645 but we don't need to actually store 0. part. As you may now, decimal part of a number is represented in binary as a negative power of 2 counting from the left, i.e. 0.1101 in binary means 1 * 2-1 + 1 * 2-2 + 0 * 2-3 + 1 * 2-4 = 1/2 + 1/4 + 0 + 1/16 = 0.5 + 0.25 + 0.0625 = 0.8125. This is how we convert our 0.9645 to binary - we recursively subtract negative powers of 2 from it until we have 0 or run out of bits (this is where precision issues raise): 0.9645 - 2-1 - 2-2 - 2-3 - 2-4 - 2-6 - ..., or 0.111101... in binary. Looking ahead, 0.9645 is 0.11110110111010010111100 in base 2 with 23-bit precision (this many bits are left out of 32 after we took 1 for sign and 8 for exponent).
Wrap up, and we get following floating-point representation of our initial number 123.456:
0 00000111 11110110111010010111100
^ ^ ^
| | |
| | +--- significand = 0.9644999504089356 ~ 0.9645
| |
| +------------------- exponent = 7
|
+------------------------- sign = 0 (positive)

Float vs Double data type in Hive

As per the Hive's documentation:
FLOAT (4-byte single precision floating point number)
DOUBLE (8-byte double precision floating point number)
What does 4-byte or 8-byte single precision floating point number mean?
4 bytes and 8 bytes are the space they take to represent. In all likelihood, the numbers are represented in IEEE 754 single precision and double precision format.
In short, floating-point numbers numbers are represented as +/- d * 2e
with respectively in the single-precision case d being limited to 0 ≤ d < 224 and in the double-precision case 0 ≤ d < 253. Note that even fractional numbers with a simple expression in decimal, like 0.1, are not automatically representable exactly in these formats. Instead 0.1 ends up represented as 13421773*2-27 in single-precision, and as 3602879701896397*2-55 in double-precision. These are good approximations because 227 is 134217728 and 255 is 36028797018963968. The double-precision approximation is better, and none of these is exact, because 0.1 can never be written as d * 2e for any integers d and e.

Maximum and minimum exponents in double-precision floating-point format

According to the IEEE Std 754-2008 standard, the exponent field width of the binary64 double-precision floating-point format is 11 bits, which is compensated by an exponent bias of 1023. The standard also specifies that the maximum exponent is 1023, and the minimum is -1022. Why is the maximum exponent not:
2^10 + 2^9 + 2^8 + 2^7 + 2^6 + 2^5 + 2^4 + 2^3 + 2^2 + 2^1 + 2^0 - 1023 = 1024
And the minimum exponent not:
0 - 1023 = -1023
Thanks!
The bits for the exponent have two reserved values, one for encoding 0 and subnormal numbers, and one for encoding ∞ and NaNs. As a result of this, the range of normal exponents is two smaller than you would otherwise expect. See §3.4 of the IEEE-754 standard (w is the number of bits in the exponent — 11 in the case of binary64):
The range of the encoding's biased exponent E shall include:
― Every integer between 1 and 2w – 2, inclusive, to encode normal numbers
― The reserved value 0 to encode ±0 and subnormal numbers
― The reserved value 2w – 1 to encode ±∞ and NaNs.

Floating point mantissa bias

Does anybody know how to go out solving this problem?
* a = 1.0 × 2^9
* b = −1.0 × 2^9
* c = 1.0 × 2^1
Using the floating-point (the representation uses a 14-bit format, 5 bits for the exponent with a bias of 16, a normalized mantissa of 8 bits, and a single sign bit for the number), perform the following two calculations, paying close attention to the order of operations.
* b + (a + c) = ?
* (b + a) + c = ?
To go through this exercise, you just follow the addition steps, as explained e.g. there:
http://en.wikipedia.org/wiki/Floating_point#Addition_and_subtraction
S EEEEE MMMMMMMM
0 11001 10000000 a
1 11001 10000000 b
0 10001 10000000 c
0 11001 00000000 c, denormalized (uh oh!)
If I'm doing this right, it looks like you can't denormalize c to a's exponent, so you end up adding 1 to -1 with the same exponent, and so you end up with 0. I believe this is a lesson on the limitations of adding a small number to a large one in a floating point format.
I will leave the second problem to you...

Resources