-
Notifications
You must be signed in to change notification settings - Fork 33
Improve accuracy of rem
with Normed
types (e.g. ::Float32 % N0f32
)
#166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Subsequent commits: master...kimikage:ctor_fixed I want to modify some constructor-style conversion methods of |
I'm not certain I understand the strategy, so some comments would be useful. Does this essentially correspond to an affine rescaling? If so does it fix values near 1.0 at the expense of others? (The answers to these questions don't block merging, but let's document this a bit more.) |
No, this is based on the "pure" linear transformation, so this is effective for almost all values other than the exception mentioned above. First, the root of this problem is that while the width of julia> rawone(Normed{UInt32, 24}) == Float32(0xFFFFFF)
true
julia> rawone(Normed{UInt32, 25}) == Float32(0x1FFFFFF)
false Therefore, if f <= 24 && return reinterpret(N, _unsafe_trunc(UInt32, round(rawone(N) * x))) Well, the x * rawone == x * 2^f - x
== x * 2^(f + k - k) - x
== x * 2^(f + k) * 2^-k - x * 2^(f + k) * 2^-(f + k) where x * rawone == x * 2^24 * 2^(f - 24) - x * 2^24 * 2^-24
== r * 2^(f - 24) - r * 2^-24 where r = _unsafe_trunc(UInt32, round(x * @f32(0x1p24)))
reinterpret(N, r << UInt8(f - 24) - unsigned(signed(r) >> 0x18)) Do I make sense? And what and how much should I comment? |
BTW, we may want to rename |
That's a great explanation. I'd say just insert a comment linking to that post and call it good. |
Codecov Report
@@ Coverage Diff @@
## master #166 +/- ##
==========================================
+ Coverage 87.93% 88.12% +0.19%
==========================================
Files 5 5
Lines 373 379 +6
==========================================
+ Hits 328 334 +6
Misses 45 45
Continue to review full report at Codecov.
|
This fixes #150.
Float32
andFloat64
versions can be unified, but the unified method may be difficult to read. So, I implemented them into two methods.As I mentioned in #150 (comment), there are still numerical errors.