-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Generational behavior for the garbage collector #8699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looking forward to this! |
heaps_lb[heap_i] = i; | ||
if (heaps_ub[heap_i] < i) | ||
heaps_ub[heap_i] = i; | ||
int j = (ffs(heap->freemap[i]) - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is ffs
defined? I'm getting an implicit declaration compiler warning on windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ffs
stands for find first set (also named count leading zeros). I thought this was provided by the libc everywhere.
We may need a few defines in some headers if the name of the function is not the same (or provide it ourselves).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we might be able to use __builtin_ffs
for MinGW (see https://bugs.freedesktop.org/show_bug.cgi?id=30277), but that won't help for MSVC. We could maybe grab musl's implementation http://git.musl-libc.org/cgit/musl/tree/src/misc/ffs.c - I'm still trying to find atomic.h
and the definition of a_ctz_l
though.
Edit: ok found atomic.h
, looks like it's using arch-dependent assembly in musl, so nevermind. Maybe something with __lzcnt
for MSVC would work http://msdn.microsoft.com/en-us/library/bb384809(v=vs.120).aspx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's an implementation of ntz (aka ctz) in libsupport:
Line 84 in 6ad8410
static int ntz(uint32_t x) |
wikipedia gives a list of transformations for easily converting from one formula to another in various ways:
http://en.wikipedia.org/wiki/Find_first_set#Properties_and_relations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, yeah, let's just not leave in anything that assumes posix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I completely forgot about this. I just added something which looks right but I don't have mingw/msvc so I didn't even compile the code...
2f9a79d
to
cc8d6f6
Compare
In #9270 @JeffBezanson proposed the following example as a benchmark for this PR. The LU factorization of a matrix of julia> @time lufact!(A);
elapsed time: 0.403522395 seconds (67595000 bytes allocated)
julia> @time lufact!(A);
elapsed time: 0.707745403 seconds (64033072 bytes allocated, 67.91% gc time)
julia> @time lufact!(A);
elapsed time: 0.40898759 seconds (64033072 bytes allocated, 46.39% gc time)
julia> @time lufact!(A);
elapsed time: 0.611059059 seconds (64033072 bytes allocated, 63.41% gc time)
julia> @time lufact!(A);
elapsed time: 0.396888493 seconds (64033072 bytes allocated, 45.54% gc time)
julia> @time lufact!(A);
elapsed time: 0.430425332 seconds (64033072 bytes allocated, 47.28% gc time)
julia> @time lufact!(A);
elapsed time: 0.60929912 seconds (64033072 bytes allocated, 63.39% gc time)
julia> @time lufact!(A);
elapsed time: 0.412151807 seconds (64033072 bytes allocated, 44.69% gc time)
julia> @time lufact!(A);
elapsed time: 0.632113227 seconds (64033072 bytes allocated, 63.05% gc time)
julia> @time lufact!(A);
elapsed time: 0.41026657 seconds (64033072 bytes allocated, 46.10% gc time) and with this branch the numbers are julia> A = big(randn(100, 100));
julia> @time lufact!(A);
elapsed time: 0.578027366 seconds (64 MB allocated, 43.96% gc time in 2 pauses with 0 full sweep)
julia> @time lufact!(A);
elapsed time: 0.367875097 seconds (61 MB allocated, 37.41% gc time in 1 pauses with 0 full sweep)
julia> @time lufact!(A);
elapsed time: 0.412482129 seconds (61 MB allocated, 41.65% gc time in 1 pauses with 0 full sweep)
julia> @time lufact!(A);
elapsed time: 0.589456554 seconds (61 MB allocated, 60.84% gc time in 2 pauses with 0 full sweep)
julia> @time lufact!(A);
elapsed time: 0.411744976 seconds (61 MB allocated, 42.76% gc time in 1 pauses with 0 full sweep)
julia> @time lufact!(A);
elapsed time: 0.560099494 seconds (61 MB allocated, 58.66% gc time in 2 pauses with 0 full sweep)
julia> @time lufact!(A);
elapsed time: 0.419516315 seconds (61 MB allocated, 44.61% gc time in 1 pauses with 0 full sweep)
julia> @time lufact!(A);
elapsed time: 0.405336 seconds (61 MB allocated, 42.50% gc time in 1 pauses with 0 full sweep)
julia> @time lufact!(A);
elapsed time: 0.573393226 seconds (61 MB allocated, 59.15% gc time in 2 pauses with 0 full sweep)
julia> @time lufact!(A);
elapsed time: 0.412730634 seconds (61 MB allocated, 41.91% gc time in 1 pauses with 0 full sweep) The numbers look very similar. Another thing: from the discussion in another thread some time ago, I got the impression that decimal MBs were the standard, but it appears that this branch uses binary MBs. |
@andreasnoack The mean and variance of that are interesting: (From the numbers you have supplied) master:
generational:
Of course due to the sample size this is not conclusive, but the tendency seems to be slightly faster and more consistent performance. If the consistent performance holds with a larger sample size then it would be a very nice thing for real-time audio and video software. |
Hey, thanks for testing this branch, it badly needs it. I'm pretty sure there are performance and correctness regressions lurking. |
A quick run through perf seems to show that actually finalizer management overhead (mostly the hashtable lookup for registration) is not negligible, however most of the time is still in mpfr_clear & malloc/free. |
To me it seems that I can't find it right now, but I swear I remember a recent conversation on one of our mailing lists in which python handily beats Julia for BigInt computations, by such a large factor that they must be doing something different. |
We haven't really tried to optimize BigInts yet. The first step is to be |
@carnaval Thanks for the feedback and sorry for going off topic in your PR. Just to get a sense of the price we are paying for reallocation of The code is in this gist. Note that we would probably be able to get (at least a good part of) the speedup without changing the code if we allowed julia> @time GMP2.pidigits(10000); # Immutable arithmetic
elapsed time: 4.470367684 seconds (8498375224 bytes allocated, 64.35% gc time)
julia> @time GMP2.pidigits2(10000); # Mutable arithmetic
elapsed time: 0.913993363 seconds (262817612 bytes allocated, 13.60% gc time) In contrast, the GHC number on my machine are
but this is not with the LLVM backend. Notice that the allocation is similar to the non-mutating Julia version. |
While using explicit mutating arithmetic is surely unbeatable performance wise, I think that using it for |
Ha, @timholy is right, we could have bignums register as dead in their finalizers by putting themselves in a recycling pool. I believe this can be done purely in julia. There is still the problem of growing/shrinking this pool but if it is the right size it would remove a lot of malloc/free calls. |
Yes. On my MacBook the Haskell version takes 3s. It is quite a bit slower than expected from the blog post.
You might be right, but could you give an example? While modifying the benchmark code I almost convinced myself that it wouldn't be a problem. |
|
Mutable numbers are semantically unacceptable and simply aren't going to happen. We clearly have a significant list of performance improvements to try here. |
Another example: A += A+B Any expression where the input and output are allowed to alias could be I suspect that even
|
The issue is that you want to be able to use the same code for Int and BigInt, and mutable BigInts would break that. If |
At this point Julia is in the shockingly rare position that generic code really is generic – you can write something generically and actually expect it to work for all kinds of types. Making |
@andreasnoack I've pushed a change for finalizers. With it and the few changes that I hadn't pushed yet, the cholfact! bench looks around 2x faster with 2x less memory usage compared to master. About the same for the (non-mutating) pidigits(10000). Can you confirm ? |
By the way, this should break finalizer ordering, do we make any guarantee about it ? |
@carnaval, my guess is the finalizer is too late---where this would make the biggest difference is in a loop inside a function, and of course the finalizer won't run frequently inside the loop. I fear this may need static analysis. |
Thanks for the examples. I can see the problem. I was thinking that it wouldn't break genericness to update It appears that if allowing @JeffBezanson I'm only trying to understand the reasoning and make visible the costs here. I appreciate the explanations and you are probably completely right but "semantically unacceptable" makes almost no meaning in my CS untrained head. In contrast, examples are really good for my understanding. @StefanKarpinski I don't want to break genericness here. Not even for performance purposes. I want to understand why the "proposal" would break genericness. As explained above, the mutating behavior was only meant for the @vtjnash I don't see how |
Generational behavior for the garbage collector
Awesome. Thanks for this contribution. |
Hooray! Quite a long saga, but I am really looking forward to this. |
The angry mob is hopefully all on release-0.3. |
@@ -217,7 +219,9 @@ static jl_value_t *eval(jl_value_t *e, jl_value_t **locals, size_t nl) | |||
size_t i; | |||
for (i=0; i < nl; i++) { | |||
if (locals[i*2] == sym) { | |||
return (locals[i*2+1] = eval(args[1], locals, nl)); | |||
locals[i*2+1] = eval(args[1], locals, nl); | |||
gc_wb(jl_current_module, locals[i*2+1]); // not sure about jl_current_module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure either. this is a stack variable slot. locals
is a JL_GC_PUSHARGS
alloca'd location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since locals
is a stack location, this gc_wb
seems unnecessary? (or perhaps, should target jl_current_task
?)
@carnaval I'm seeing some major performance improvement with the new GC. The performance test numbers in our quantum system simulator (https://github.com/BBN-Q/QSimulator.jl) all dropped by nearly a factor of 2. Nice work! |
While I'm still adjusting to this change, this is already subtly changing my julia programming style: there's a whole "mid-layer" of problems where I'm finding that I'm noticeably less worried about allocating memory than I used to be. I'd call that a pretty big impact. |
I'd love to make a more comprehensive performance test bed that has a few algorithms implemented in a few different styles - some much more garbage-generating than others. The idea is to test how performance varies for code that isn't written optimally (like the kinds of code you see on StackOverflow) |
+1 to @IainNZ's idea |
There are improvements in perf benchmarks of vectorized codes. While expected, it is nice to actually realize it. The stockcorr benchmark is now equally fast in both, vectorized and devectorized cases. |
@@ -384,11 +391,13 @@ static jl_value_t *eval(jl_value_t *e, jl_value_t **locals, size_t nl) | |||
// temporarily assign so binding is available for field types | |||
check_can_assign_type(b); | |||
b->value = (jl_value_t*)dt; | |||
gc_wb_binding(b,dt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you use gc_wb_binding(((void**)b)-1, dt);
in the other two usages of gc_wb_binding, but not here? @carnaval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake. Thanks !
rval = boxed(emit_expr(r, ctx, true),ctx,rt); | ||
rval = boxed(emit_expr(r, ctx, true), ctx, rt); | ||
if (!is_stack(bp)) { | ||
Value* box = builder.CreateGEP(bp, ConstantInt::get(T_size, -1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this should be calling gc_queue_binding
on -1-offsetof(jl_binding_t,value)/sizeof(jl_value_t*))
, no?
edit: eep, this could be a jl_binding_t
, or a closure location (Box
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably time for me to understand a bit more about the different ways we store variables.
The case of a closure location is when a captured variable is assigned in the child scope. At this point, if I understand correctly, we store every such variable in a separate box ? For those cases the code here then seems correct ?
I don't see how a jl_binding_t could get here since I thought they were only used for globals which are handled by the runtime in jl_checked_assign.
The rest is stored on the local gc frame on the stack and doesn't need write barrier.
Is there a case I'm not considering ?
Thanks for taking the time to go through this since the codegen can be a bit opaque to me at times :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, no, you are right. i forgot there is a branch in emit_assignment
on whether this is a jl_binding_t
or a Box
/me goes looking elsewhere for the source of his bug
This is really odd, bisect points to this merge as breaking the Linux-to-Windows cross-compile. During bootstrap of the system image, we get a segfault while compiling inference. This could easily be a bug in Wine. https://gist.github.com/c361c8157820e4e8734c |
heh, i had just come to the exact same conclusion a few minutes ago |
This really deserves an entry in NEWS. A blog post with some performance examples would be neat too -- could be written by anyone who sees a big boost from this in an interesting use case. (@ssfrr?) |
I think that the perf benchmark on european option pricing saw a major increase, and is a good one to talk about. I have generally seen many vectorized codes speed up. I was just sitting with someone working with a wave equation solver that was largely vectorized code, and was slower than octave on 0.3, but became faster than octave with 0.4. |
On a slightly unrelated note, it would be great to rebase the Before the generational GC, the runtime was largely thread safe - and we probably want to merge it into master (disabled by default), so that it is easier to maintain and can receive more contributions. Cc: @kpamnany @ArchRobison who have worked on the |
#endif | ||
|
||
#ifdef __cplusplus | ||
extern "C" { | ||
#endif | ||
|
||
typedef struct _gcpage_t { | ||
char data[GC_PAGE_SZ]; | ||
#pragma pack(push, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you forceably un-align all of these data structures? (including overriding the attempted __attribute__((aligned (64)))
below. although aligning to 64-bytes was probably a bit overkill too)
Same as #5227 but living in this repo & squashed.