Generational behavior for the garbage collector #8699

carnaval · 2014-10-16T12:14:24Z

Same as #5227 but living in this repo & squashed.

catawbasam · 2014-10-16T19:25:33Z

Looking forward to this!

tkelman · 2014-10-17T07:00:26Z

src/gc.c

+        heaps_lb[heap_i] = i;
+    if (heaps_ub[heap_i] < i)
+        heaps_ub[heap_i] = i;
+    int j = (ffs(heap->freemap[i]) - 1);


Where is ffs defined? I'm getting an implicit declaration compiler warning on windows.

ffs stands for find first set (also named count leading zeros). I thought this was provided by the libc everywhere.
We may need a few defines in some headers if the name of the function is not the same (or provide it ourselves).

It looks like we might be able to use __builtin_ffs for MinGW (see https://bugs.freedesktop.org/show_bug.cgi?id=30277), but that won't help for MSVC. We could maybe grab musl's implementation http://git.musl-libc.org/cgit/musl/tree/src/misc/ffs.c - I'm still trying to find atomic.h and the definition of a_ctz_l though.

Edit: ok found atomic.h, looks like it's using arch-dependent assembly in musl, so nevermind. Maybe something with __lzcnt for MSVC would work http://msdn.microsoft.com/en-us/library/bb384809(v=vs.120).aspx

there's an implementation of ntz (aka ctz) in libsupport:

julia/src/support/bitvector.c

Line 84 in 6ad8410

static int ntz(uint32_t x)

wikipedia gives a list of transformations for easily converting from one formula to another in various ways:
http://en.wikipedia.org/wiki/Find_first_set#Properties_and_relations

great, yeah, let's just not leave in anything that assumes posix

Well I completely forgot about this. I just added something which looks right but I don't have mingw/msvc so I didn't even compile the code...

andreasnoack · 2014-12-10T14:34:32Z

In #9270 @JeffBezanson proposed the following example as a benchmark for this PR. The LU factorization of a matrix of BigFloats allocates and discards a lot of small BigFloats. Therefore considerably time is spend with GC. With latest master I get

julia> @time lufact!(A);
elapsed time: 0.403522395 seconds (67595000 bytes allocated)

julia> @time lufact!(A);
elapsed time: 0.707745403 seconds (64033072 bytes allocated, 67.91% gc time)

julia> @time lufact!(A);
elapsed time: 0.40898759 seconds (64033072 bytes allocated, 46.39% gc time)

julia> @time lufact!(A);
elapsed time: 0.611059059 seconds (64033072 bytes allocated, 63.41% gc time)

julia> @time lufact!(A);
elapsed time: 0.396888493 seconds (64033072 bytes allocated, 45.54% gc time)

julia> @time lufact!(A);
elapsed time: 0.430425332 seconds (64033072 bytes allocated, 47.28% gc time)

julia> @time lufact!(A);
elapsed time: 0.60929912 seconds (64033072 bytes allocated, 63.39% gc time)

julia> @time lufact!(A);
elapsed time: 0.412151807 seconds (64033072 bytes allocated, 44.69% gc time)

julia> @time lufact!(A);
elapsed time: 0.632113227 seconds (64033072 bytes allocated, 63.05% gc time)

julia> @time lufact!(A);
elapsed time: 0.41026657 seconds (64033072 bytes allocated, 46.10% gc time)

and with this branch the numbers are

julia> A = big(randn(100, 100));

julia> @time lufact!(A);
elapsed time: 0.578027366 seconds (64 MB allocated, 43.96% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.367875097 seconds (61 MB allocated, 37.41% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.412482129 seconds (61 MB allocated, 41.65% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.589456554 seconds (61 MB allocated, 60.84% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.411744976 seconds (61 MB allocated, 42.76% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.560099494 seconds (61 MB allocated, 58.66% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.419516315 seconds (61 MB allocated, 44.61% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.405336 seconds (61 MB allocated, 42.50% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.573393226 seconds (61 MB allocated, 59.15% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.412730634 seconds (61 MB allocated, 41.91% gc time in 1 pauses with 0 full sweep)

The numbers look very similar.

Another thing: from the discussion in another thread some time ago, I got the impression that decimal MBs were the standard, but it appears that this branch uses binary MBs.

vchuravy · 2014-12-10T14:49:43Z

@andreasnoack The mean and variance of that are interesting: (From the numbers you have supplied)

master:

mean: 0.5022458996
variance: 0.01485280295568762

generational:

mean: 0.47306617910000004
variance: 0.007977234256117448

Of course due to the sample size this is not conclusive, but the tendency seems to be slightly faster and more consistent performance. If the consistent performance holds with a larger sample size then it would be a very nice thing for real-time audio and video software.

carnaval · 2014-12-10T15:42:25Z

Hey, thanks for testing this branch, it badly needs it. I'm pretty sure there are performance and correctness regressions lurking.
In that case 80% of the time is spent in finalizers which is reported as gc time. Using GC_TIME reveals this (post_mark is where finalizers are checked and C ones are run). GC_FINAL_STATS should also show this but it doesn't as of now (small time counting bug, I'll push a fix).
I'm not sure we can do much about this. Maybe optimize the way finalizers are stored/checked but in that case I think the bottleneck is in the actual finalizing code freeing the memory (no measure where done to assert this claim).

carnaval · 2014-12-10T15:47:22Z

A quick run through perf seems to show that actually finalizer management overhead (mostly the hashtable lookup for registration) is not negligible, however most of the time is still in mpfr_clear & malloc/free.

timholy · 2014-12-11T00:05:28Z

To me it seems that BigInt computations are the poster child for reusing memory, not allocating & freeing---if you're doing computations in a loop, you ideally want to have pre-allocated temp variables in which you store your intermediate results, rather than allocating and freeing on each add and multiply.

I can't find it right now, but I swear I remember a recent conversation on one of our mailing lists in which python handily beats Julia for BigInt computations, by such a large factor that they must be doing something different.

tkelman · 2014-12-11T00:09:41Z

@timholy http://www.wilfred.me.uk/blog/2014/10/20/the-fastest-bigint-in-the-west/ ?

JeffBezanson · 2014-12-11T02:20:21Z

We haven't really tried to optimize BigInts yet. The first step is to be
able to stack allocate BigInts that fit in 1 word. The new Bytes type
Stefan is developing for strings could be useful here. This isn't easy but
should be doable.

andreasnoack · 2014-12-11T15:54:38Z

@carnaval Thanks for the feedback and sorry for going off topic in your PR.

Just to get a sense of the price we are paying for reallocation of BigInts right now, I tried to add mutating arithmetic for BigInts and run the benchmark in @tkelman's link. This relates to #249, #1115, #3022, #3424.

The code is in this gist. Note that we would probably be able to get (at least a good part of) the speedup without changing the code if we allowed += and *= to update in place. I'm wondering if we are giving too much preference to machine numbers in the decision that a+=b is a = a + b.

julia> @time GMP2.pidigits(10000); # Immutable arithmetic
elapsed time: 4.470367684 seconds (8498375224 bytes allocated, 64.35% gc time)

julia> @time GMP2.pidigits2(10000); # Mutable arithmetic
elapsed time: 0.913993363 seconds (262817612 bytes allocated, 13.60% gc time)

In contrast, the GHC number on my machine are

Andreass-MacBook-Pro:Haskell_GHC andreasnoack$ ./bin 10000 +RTS -sstderr > output.txt
   8,450,072,008 bytes allocated in the heap
       5,839,000 bytes copied during GC
         318,288 bytes maximum residency (116 sample(s))
         117,808 bytes maximum slop
               4 MB total memory in use (1 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     15427 colls,     0 par    0.07s    0.08s     0.0000s    0.0001s
  Gen  1       116 colls,     0 par    0.01s    0.01s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.98s  (  3.01s elapsed)
  GC      time    0.08s  (  0.09s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    3.06s  (  3.10s elapsed)

  %GC     time       2.5%  (2.9% elapsed)

  Alloc rate    2,831,170,217 bytes per MUT second

  Productivity  97.5% of total user, 96.4% of total elapsed

but this is not with the LLVM backend. Notice that the allocation is similar to the non-mutating Julia version.

carnaval · 2014-12-11T16:38:46Z

While using explicit mutating arithmetic is surely unbeatable performance wise, I think that using it for ?= operators would lead to annoying aliasing bugs in generic code, and you would have to write different version for big/small numbers anyway. I'm all for exposing those primitives as separate functions.
Am I correctly reading the ~3s timing for the haskell program ? We should be able to get closer by optimizing finalizers & allocation for gmp.
I'm not sure why finalizers are using a hashtable anyway, since we are going through the whole list at every collection (and I'd bet that jl_finalize is less performance critical than bignum allocation).
About allocation, we could use our pools to store small bignums, but it would generate a lot of garbage on realloc calls so I'm not sure it is worth it.
I don't think there is much more we can do without static lifetime analysis, but I'd love to be wrong.

carnaval · 2014-12-11T16:46:09Z

Ha, @timholy is right, we could have bignums register as dead in their finalizers by putting themselves in a recycling pool. I believe this can be done purely in julia. There is still the problem of growing/shrinking this pool but if it is the right size it would remove a lot of malloc/free calls.

andreasnoack · 2014-12-11T16:48:27Z

Yes. On my MacBook the Haskell version takes 3s. It is quite a bit slower than expected from the blog post.

using [explicit mutating arithmetic] for ?= operators would lead to annoying aliasing bugs in generic code

You might be right, but could you give an example? While modifying the benchmark code I almost convinced myself that it wouldn't be a problem.

carnaval · 2014-12-11T16:53:07Z

a = 0; b = a; a += 1
(a == b) gives different results in mutating/non-mutating arithmetic.
I agree that it may not show up in idiomatic code, but I'd say it's even worse since bugs will be harder to find.

JeffBezanson · 2014-12-11T17:07:32Z

Mutable numbers are semantically unacceptable and simply aren't going to happen. We clearly have a significant list of performance improvements to try here.

vtjnash · 2014-12-11T17:09:59Z

Another example:

A += A+B

Any expression where the input and output are allowed to alias could be
invalid or have subtle bugs.

I suspect that even
A *= A'
Could be rather problematic (assuming A is square)
On Thu, Dec 11, 2014 at 11:53 AM Oscar Blumberg [email protected]
wrote:

a = 0; b = a; a += 1
(a == b) gives different results in mutating/non-mutating arithmetic.
I agree that it may not show up in idiomatic code, but I'd say it's even
worse since bugs will be harder to find.

—
Reply to this email directly or view it on GitHub
#8699 (comment).

JeffBezanson · 2014-12-11T17:13:29Z

The issue is that you want to be able to use the same code for Int and BigInt, and mutable BigInts would break that. If += mutated some types and not others, we'd basically have to tell people to never use it, to make their code "BigInt safe" :shudder:.

StefanKarpinski · 2014-12-11T17:34:48Z

At this point Julia is in the shockingly rare position that generic code really is generic – you can write something generically and actually expect it to work for all kinds of types. Making += mutating and/or having semantically mutable number types would ruin this. I think the ability to write really generic code is more fragile than people may realize.

carnaval · 2014-12-11T19:04:11Z

@andreasnoack I've pushed a change for finalizers. With it and the few changes that I hadn't pushed yet, the cholfact! bench looks around 2x faster with 2x less memory usage compared to master. About the same for the (non-mutating) pidigits(10000). Can you confirm ?

carnaval · 2014-12-11T19:07:54Z

By the way, this should break finalizer ordering, do we make any guarantee about it ?

timholy · 2014-12-11T19:28:12Z

@carnaval, my guess is the finalizer is too late---where this would make the biggest difference is in a loop inside a function, and of course the finalizer won't run frequently inside the loop. I fear this may need static analysis.

andreasnoack · 2014-12-11T19:52:34Z

Thanks for the examples. I can see the problem. I was thinking that it wouldn't break genericness to update a in a+=b because the expression tells us that our existing a is free to use for the result. However, the examples taught me that this is only true if a isn't aliased with another variable.

It appears that if allowing *= to mutate a BigInt then it would be necessary to disallow aliasing in order to retain the number feeling of BigInts. Aliasing doesn't cause trouble now because we have made sure that all BigInt functions allocate a new output variable. Is there a way of enforcing that two BigInt variables are never aliased?

@JeffBezanson I'm only trying to understand the reasoning and make visible the costs here. I appreciate the explanations and you are probably completely right but "semantically unacceptable" makes almost no meaning in my CS untrained head. In contrast, examples are really good for my understanding.

@StefanKarpinski I don't want to break genericness here. Not even for performance purposes. I want to understand why the "proposal" would break genericness. As explained above, the mutating behavior was only meant for the *= and += type operations which I thought wouldn't cause trouble.

@vtjnash I don't see how A+=A+B is a problem. A+B has to be stored in a temporary variable, say C=A+B (only += should be mutating) and then afterwards A is updated with the value of A+C. Am I wrong? Regarding A *= A', was the prime then intentional? With no prime or lazy transpose, the expression should throw an error. I guess that would be possible with is(A,A).

Generational behavior for the garbage collector

tknopp · 2015-01-24T12:22:17Z

Awesome. Thanks for this contribution.

timholy · 2015-01-24T14:04:54Z

Hooray! Quite a long saga, but I am really looking forward to this.

StefanKarpinski · 2015-01-24T18:23:33Z

staticfloat · 2015-01-24T19:07:09Z

ivarne · 2015-01-24T19:09:58Z

The angry mob is hopefully all on release-0.3.

vtjnash · 2015-01-25T00:28:54Z

src/interpreter.c

@@ -217,7 +219,9 @@ static jl_value_t *eval(jl_value_t *e, jl_value_t **locals, size_t nl)
        size_t i;
        for (i=0; i < nl; i++) {
            if (locals[i*2] == sym) {
-                return (locals[i*2+1] = eval(args[1], locals, nl));
+                locals[i*2+1] = eval(args[1], locals, nl);
+                gc_wb(jl_current_module, locals[i*2+1]); // not sure about jl_current_module


i'm not sure either. this is a stack variable slot. locals is a JL_GC_PUSHARGS alloca'd location.

since locals is a stack location, this gc_wb seems unnecessary? (or perhaps, should target jl_current_task?)

blakejohnson · 2015-01-26T15:01:18Z

@carnaval I'm seeing some major performance improvement with the new GC. The performance test numbers in our quantum system simulator (https://github.com/BBN-Q/QSimulator.jl) all dropped by nearly a factor of 2. Nice work!

timholy · 2015-01-26T15:13:51Z

While I'm still adjusting to this change, this is already subtly changing my julia programming style: there's a whole "mid-layer" of problems where I'm finding that I'm noticeably less worried about allocating memory than I used to be.

I'd call that a pretty big impact.

IainNZ · 2015-01-26T15:29:03Z

I'd love to make a more comprehensive performance test bed that has a few algorithms implemented in a few different styles - some much more garbage-generating than others. The idea is to test how performance varies for code that isn't written optimally (like the kinds of code you see on StackOverflow)

johnmyleswhite · 2015-01-26T15:30:21Z

+1 to @IainNZ's idea

ViralBShah · 2015-01-26T16:28:42Z

There are improvements in perf benchmarks of vectorized codes. While expected, it is nice to actually realize it. The stockcorr benchmark is now equally fast in both, vectorized and devectorized cases.

vtjnash · 2015-02-11T03:28:17Z

src/interpreter.c

@@ -384,11 +391,13 @@ static jl_value_t *eval(jl_value_t *e, jl_value_t **locals, size_t nl)
        // temporarily assign so binding is available for field types
        check_can_assign_type(b);
        b->value = (jl_value_t*)dt;
+        gc_wb_binding(b,dt);


why do you use gc_wb_binding(((void**)b)-1, dt); in the other two usages of gc_wb_binding, but not here? @carnaval

My mistake. Thanks !

vtjnash · 2015-02-12T06:10:51Z

src/codegen.cpp

-                rval = boxed(emit_expr(r, ctx, true),ctx,rt);
+                rval = boxed(emit_expr(r, ctx, true), ctx, rt);
+                if (!is_stack(bp)) {
+                    Value* box = builder.CreateGEP(bp, ConstantInt::get(T_size, -1));


i think this should be calling gc_queue_binding on -1-offsetof(jl_binding_t,value)/sizeof(jl_value_t*)), no?

edit: eep, this could be a jl_binding_t, or a closure location (Box)

It's probably time for me to understand a bit more about the different ways we store variables.
The case of a closure location is when a captured variable is assigned in the child scope. At this point, if I understand correctly, we store every such variable in a separate box ? For those cases the code here then seems correct ?
I don't see how a jl_binding_t could get here since I thought they were only used for globals which are handled by the runtime in jl_checked_assign.
The rest is stored on the local gc frame on the stack and doesn't need write barrier.
Is there a case I'm not considering ?
Thanks for taking the time to go through this since the codegen can be a bit opaque to me at times :-)

oops, no, you are right. i forgot there is a branch in emit_assignment on whether this is a jl_binding_t or a Box

/me goes looking elsewhere for the source of his bug

tkelman · 2015-02-15T03:58:40Z

This is really odd, bisect points to this merge as breaking the Linux-to-Windows cross-compile. During bootstrap of the system image, we get a segfault while compiling inference. This could easily be a bug in Wine. https://gist.github.com/c361c8157820e4e8734c

vtjnash · 2015-02-15T04:03:24Z

heh, i had just come to the exact same conclusion a few minutes ago

ihnorton · 2015-02-18T05:12:53Z

This really deserves an entry in NEWS.

A blog post with some performance examples would be neat too -- could be written by anyone who sees a big boost from this in an interesting use case. (@ssfrr?)

ViralBShah · 2015-02-21T17:53:18Z

I think that the perf benchmark on european option pricing saw a major increase, and is a good one to talk about. I have generally seen many vectorized codes speed up. I was just sitting with someone working with a wave equation solver that was largely vectorized code, and was slower than octave on 0.3, but became faster than octave with 0.4.

ViralBShah · 2015-02-21T17:56:47Z

On a slightly unrelated note, it would be great to rebase the threading branch on top of master - where the work is largely to make the new GC thread safe.

Before the generational GC, the runtime was largely thread safe - and we probably want to merge it into master (disabled by default), so that it is easier to maintain and can receive more contributions.

Cc: @kpamnany @ArchRobison who have worked on the threading branch.

vtjnash · 2015-03-14T04:59:01Z

src/gc.c

 #endif

 #ifdef __cplusplus
 extern "C" {
 #endif

-typedef struct _gcpage_t {
-    char data[GC_PAGE_SZ];
+#pragma pack(push, 1)


why do you forceably un-align all of these data structures? (including overriding the attempted __attribute__((aligned (64))) below. although aligning to 64-bytes was probably a bit overkill too)

tkelman reviewed Oct 17, 2014
View reviewed changes

jiahao force-pushed the master branch from cdde4df to 7fdc860 Compare October 28, 2014 04:20

MikeInnes force-pushed the master branch from 5c60996 to b1c3df3 Compare November 14, 2014 17:07

carnaval force-pushed the ob/gengc branch 2 times, most recently from 2f9a79d to cc8d6f6 Compare November 18, 2014 13:10

carnaval mentioned this pull request Nov 22, 2014

faster pool allocation scheme #9106

Closed

JeffBezanson mentioned this pull request Dec 10, 2014

RFC: Add generic copy stagedfunction for all types #9270

Closed

carnaval force-pushed the ob/gengc branch from ed555fe to 05efc56 Compare December 10, 2014 16:08

andreasnoack mentioned this pull request Dec 11, 2014

in-place assignment operator? #249

Closed

carnaval added a commit that referenced this pull request Jan 24, 2015

Merge pull request #8699 from JuliaLang/ob/gengc

0bfe05d

Generational behavior for the garbage collector

carnaval merged commit 0bfe05d into master Jan 24, 2015

vtjnash reviewed Jan 25, 2015
View reviewed changes

timholy mentioned this pull request Jan 25, 2015

Segfault due to free(): invalid pointer #9924

Closed

hayd mentioned this pull request Feb 9, 2015

WIP: Pooling BigInts and BigFloats #10084

Closed

vtjnash reviewed Feb 11, 2015
View reviewed changes

JeffBezanson mentioned this pull request Feb 12, 2015

gc enhancements #261

Closed

vtjnash reviewed Feb 12, 2015
View reviewed changes

ArchRobison mentioned this pull request Feb 24, 2015

Making Generational GC Thread-Safe #10317

Closed

JonathanAnderson mentioned this pull request Mar 3, 2015

mmap failure with address space quotas #10390

Closed

vtjnash reviewed Mar 14, 2015
View reviewed changes

tkelman deleted the ob/gengc branch April 19, 2015 11:48

carnaval mentioned this pull request Jun 16, 2015

Change dump of reinit_list to 8 bytes when size_t is 8 bytes #11363

Closed

Generational behavior for the garbage collector #8699

Generational behavior for the garbage collector #8699

Conversation

carnaval commented Oct 16, 2014

catawbasam commented Oct 16, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreasnoack commented Dec 10, 2014

vchuravy commented Dec 10, 2014

carnaval commented Dec 10, 2014

carnaval commented Dec 10, 2014

timholy commented Dec 11, 2014

tkelman commented Dec 11, 2014

JeffBezanson commented Dec 11, 2014

andreasnoack commented Dec 11, 2014

carnaval commented Dec 11, 2014

carnaval commented Dec 11, 2014

andreasnoack commented Dec 11, 2014

carnaval commented Dec 11, 2014

JeffBezanson commented Dec 11, 2014

vtjnash commented Dec 11, 2014

JeffBezanson commented Dec 11, 2014

StefanKarpinski commented Dec 11, 2014

carnaval commented Dec 11, 2014

carnaval commented Dec 11, 2014

timholy commented Dec 11, 2014

andreasnoack commented Dec 11, 2014

tknopp commented Jan 24, 2015

timholy commented Jan 24, 2015

StefanKarpinski commented Jan 24, 2015

staticfloat commented Jan 24, 2015

ivarne commented Jan 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blakejohnson commented Jan 26, 2015

timholy commented Jan 26, 2015

IainNZ commented Jan 26, 2015

johnmyleswhite commented Jan 26, 2015

ViralBShah commented Jan 26, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented Feb 15, 2015

vtjnash commented Feb 15, 2015

ihnorton commented Feb 18, 2015

ViralBShah commented Feb 21, 2015

ViralBShah commented Feb 21, 2015

Choose a reason for hiding this comment