You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do we really want to return in any cases/platforms float32? (Changing this could break usage of cfunits.)
I believe that every 4 byte integer could be stored in a 8 byte float (double) without losing precision. Therefore it could be more accurate to use numpy.float64 in all cases. Further for an integer with more than 4 bytes a float32 is not sufficient. What do you think?
The text was updated successfully, but these errors were encountered:
Recently I was talking with a friend about floating-point numbers and suddenly this issue here gets clear to me.
If we assume IEEE 754, we get for a 4 byte float (float32):
s: 1 bit for the sign
m_i: 23 bits for the mantissa
e_i: 8 bits for the exponent
And with a few mathematical formulas we get the decimal number from a few bits:
sign: S = (-1)^s
mantissa: M = 1 + \sum_{i=0}^22){m_i * 2**i} / 2^23
exponent: E = \sum_{i=0}^7){e_i * 2**i} - (2^7-1) = \sum_{i=0}^7){e_i * 2**i} - 127
decimal number: S * M * 2^E
And similar for a double/float64:
sign: S = (-1)^s
mantissa: M = 1 + \sum_{i=0}^51){m_i * 2**i} / 2^52
exponent: E = \sum_{i=0}^10){e_i * 2**i} - (2^10-1) = \sum_{i=0}^10){e_i * 2**i} - 1023
decimal number: S * M * 2^E
So, this means for a float32 we can only use 23 bits of the original mantissa of an integer. Converting an int32 to float32 leads to a relative error of up to \abs{\frac{2^24+1 - float32(2^24+1)}{2^24+1}} ~= 6e-8.
Whereas using a float64 we could store the full precision of an int32.
But converting an int64 to float64 leads again to a relative error of up to \abs{\frac{2^53+1 - float64(2^53+1)}{2^53+1}} ~= 1e-16.
So in both cases (int32 -> float32, int64 -> float64) we have a relative error of the machine epsilon.
There are some tools, where you can play around (and maybe more and better ones):
Do we really want to return in any cases/platforms float32? (Changing this could break usage of cfunits.)
I believe that every 4 byte integer could be stored in a 8 byte float (double) without losing precision. Therefore it could be more accurate to use
numpy.float64
in all cases. Further for an integer with more than 4 bytes afloat32
is not sufficient. What do you think?The text was updated successfully, but these errors were encountered: