06 Optional IEEE Floating point Standard


[MUSIC]. Alright, time to turn our attention to how we represent floating-point numbers. And specifically using the IEEE standard notation for floating-point numbers, which all modern computing systems use. Alright, so floating-point is analagous to scientific notation. you remember maybe in some of your science classes, that you wouldn't represent a number like, 12 million. written with all those zeros behind it, but rather as 1.2 times 10 to the 7th. Similarly a really tiny number like this one 0.0000012, we can represent as 1.2 times 10 to the minus 6. In fact, C supports this notation, by letting you describe floating-point numbers as 1.2e7 and, 1.2e minus 6 for the two examples above. Alright this is this goes back to IEEE standard 754 which was established in 1985 as a uniform standard for floating-point arithmetic. Before that, there were all kinds of different formats that were very difficult to combine. but today's CPUs all use this same standard. And it's really driven by, the standardization was really driven by numerical concerns. standards for handling routing overflow and underflow, representing Rep, representing things like division by zero and so on. And it ended up creating a standard that is very hard to make fast in hardware, but is numerically very well behaved. And those concerns dominated that standardization effort. Let's take a look at the details of of the IEEE floating-point representation. If we have a value in base 10 we are going to represent it as a magnitude. And then exponent for a power of 2 since we are going to binary numbers. And then we will also have a sign bit for the entire number, so this is back to sign and magnitude notation. Okay, so the sign bit is going to determine whether the number is negative or positive. Then the significand, or the mantissa, M is normally a fractional value. Something in the range of 1.0 to 2. And you notice that it can be exactly 1.0, but just a smidgen less than 2. That's why we use the rounded parentheses on that side. And then the exponent is possibly negative, of course. And can multiplies the mantissa by that power of 2. Okay. So the representation then in memory is going to be that we're going to have one bit, since that's all we need for the sign bit. some number of bits for the exponent, we're going to use, have a field called exp for that. we're going to notice that it's going to encode, the value of E, but it is not exactly E. We'll see what I mean by that in a bit. And then a fractional field that encodes, encodes the mantissa, but again is not exactly equal to the mantissa. And we'll see what the difference is in just a sec. So let's, get to that. so now, how many bits do we assign to each of these, fields? We said that we're going to have one bit for the sign, that's easy enough. for floating-point number represented in 32 bits. the actual police standard says we're going to use 8 bits for the exponent. That's going to limit how large and how small our numbers can get. And then um,, we're going to use the remaining 23 bits for the representing that mantissa or the fractional part. And that will determine our precision, okay. So we have range and precision, and of course the trade-off between the two is how many bits we use for each. So in IEEE floating-point, there's also a 64 bit representation of floats or doubles. that uses 11 bits for the exponent and 52 for the fractions. So quite a bit more range and also more quite a bit more precision and also a bit more range. Alright so lets talk about the mantissa first. The the significant. We're going to talk about normalized numbers, meaning that the mantissa is always going to be of the form one point xxxxx some binary bits. this is analgous to what we do with floating-point notation scientific notation , in decimal numbers. We always have values that start with one point something, okay. So if wanted to represent the number .011 times 2 to the 5th, we would normalize that to be 1.1 times 2 to the 3. Okay? And those are exactly the same. But the latter makes better use of the variable bits, because we don't have to bother with those extra zeros. And actually since we now the mantissas always going to start with that one point at the beginning, we not even going to bother to store that in our representation. Why waste the bit on somethign we know is always goign to be there, so that's why the fraction doesnt include the mantissa exactly. The fraction only encodes this part of the mantissa. those binary digits to the right of the binary point. It does not encode the one to the left, okay? But now we have to also ask ourselves a question, for, how do we represent the number 0.0? Ideally we'd like it to be the all zeros number as well. You know, if we have zeros throughout our 32 bits, I would still like that to correspond to zero. So, we have to figure out how to get that to work out exactly. so that's going to pose some challenges for us. And then what about values that like 1 divided by 0, which yield a basically something that is not a number. How are we going to encode that? So what we're going to do is reserve a couple of exponent values, exponent field values. to handle these cases. The special values we're most interested in, as I've already mentioned, is the case of having the bit pattern of all zeros represent a zero. So any exponent that has all zero bits here, should be help, should be used to help us represent that zero. we're also going to reserve an exponent of all ones for two other kinds of values that we need. if the exponent is all ones, and the fractional part is all zeros, then that's going to represent infinity and or a very large number. and of course we'll have positive infinity and negative infinity because we can have the sign bit represent that for us. similarly if the fraction is not zero, we're going to use that to represent not a number. That's still within exponent of all ones. And not a number is an important value to use. For operations that have an undefined result. Things like the square root of minus one, infinity minus infinity or an infinity times zero. those are clearly not ones we can come up with a numeric value for. So we're going to reserve these exponents of zero and all ones for this purpose. So now let's turn our attention to how we deal with that exponent field. Since we can't use zero, all zeros and all ones because we need those for those special values. we're going to encode the exponent using a Bias value. Basically, the real exponent that we want on the number the value E, the exponent of the power of 2. is going to be represented using this exp field, the exponent field, minus a Bias, okay. And the Bias is an unsigned value ranging from 1 to 2 to the k minus 2, where k is the number of bits in the exponent field. So we're going to use a Bias of 2 to the k minus 1 minus 1. Alright, let's see what that really means. for single precision, that value turns out to be 127. That means that since we can have exponents from 1 to 254 using 8 bits. Remember we're not using zero and we're not using 255, because those are reserved special values. That will then correspond to an exponent from minus 126 to 127. So what that Bias lets us do, is represent both positive and negative exponents. In that range of 1 to 254 for the bit patterns in the exponent field. For double precision, of course we have 11 bits. So we go from 1 to 2046, and the Bias is going to be a little bit more, it's going to be 1023. So that the exponent we can represent are minus 1022 to positive 1023. Okay, so these enable both these large positive exponents for representing large numbers. And very small values by having a negative exponent. Okay, so the significand as I've mentioned is then encoded without that leaving 1 on the mantissa. We just represent those other bits. And a significand that has all zeros would correspond to 1 minus 0 1.0 because the 1 point is assumed and then zeros following. If we have a mantissa that is all ones, that's equivalent 1.11111, which is very close to 2, but not quite 2. Okay. So we get that leading extra bit for free. So now we've seen how we encode both the E and the M in our exponent and fractional parts. Okay? That's why it's not an exact representation of those but rather an encoding. Alright, so let's look at the floating-point number 12345.0. remember, that is that same old bit pattern for 12345. and now we have to normalize it. put it in a form where there the significance starts with one point. So the way we would that is by moving that binary point 13 positions over to be right after the leading one. And then we have an exponent of 2 to the 13. So that's our normalized form, and now we can encode the significand which is just this value brought down here. And of course we're not going to bother with the leading 1, we're just going to use the rest of the bits for the fractional part. And that will lead to a fractional 23 bits that we'll be using that will look like this. And you notice we've just padded with trailing zeros at the end because we have to have some bit values there. So we don't want to change the value we use all zeros. Alright, the exponent, remember we have to use that Bias so our exponent field is going to be the value of E, plus the bias. And the Bias remember was 127, so our exponent is 13, when we add 127, we have an exponent of 140. And that will be the bit pattern for 140 that we will use in, in the 8 bit field for the exponent. So the result is this representation for our floating-point number 12,345.0. Okay, not immediately obvious at all by looking at those bits. But you can see the process that we go though. first that normalization, then taking the fractional part of the mantissa. And then, adding the Bias to the exponent, okay?

Wyszukiwarka