-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase default stack size limit on 64-bit systems #55185
Increase default stack size limit on 64-bit systems #55185
Conversation
This comes with a high risk of blowing up the page tables which blows up kernel memory usage on Linus and causes OOM kills with half the number of tasks. We could reduce this default, but increasing seems like a bad idea, as applications that benefit from it are likely to be more substantially benefited from a rewrite anyways |
Edit: A stack size limit of 8 MB has no effect on OOM errors; see experiments in #55185 (comment). This is virtual memory rather than physical memory, so it would only cause these issues if someone were to actually use a larger stack. But the limit itself just determines what is allowed/disallowed by a user. By the way, I very much agree about rewriting code to avoid deep stacks. However, even Julia inference itself can get very deep – as seen in EnzymeAD/Enzyme.jl#1156 (comment) which can cause AD tools like Enzyme to run into stack overflows for normal code. (Hence why it would be nice for I guess the main quality of life thing that would be nice with an 8 MB stack for tasks is that the root task in Julia already has an 8 MB stack. This inconsistency means that sometimes a bug you run into in multithreading code can't be reproduced by the serial version. (And depending on the package, sometimes the debug info doesn't make it clear that a stack overflow was hit). I think this is actually why the bug I had in that Enzyme issue was hard to reproduce, as the stack size required for compilation was about 6 MB. So if I was the compilation was performed from the root task, it was fine, but if the compilation was performed from a Maybe the best thing to do in the future would be to have the default |
Julia inference is indeed one of those things that has needed to be re-written for several reasons, though it is a bit of a slog to fix that |
Got it. I mean, in the meantime it would be nice to at least have #55184 so it's clear there are officially supported ways to work around such issues. I do think it would be nice to have 8 MB stack in tasks to make limits consistent with the root task. Since it's virtual memory, there wouldn't be inherent performance changes, right? Or does libuv do anything different with virtual address space? Otherwise, I guess it comes down to which is the worse footgun:
Both are annoying but I think (1) is worse because it depends on whether a secondary thread or the root thread reaches a function first – since they have 2x different stack size limits. And having stack overflows in a secondary thread can result in confusing debugging info. (2) seems less of an issue (though still problematic) because if you are hitting OOM errors with 8 MB stacks, you should already notice high memory consumption with a 4 MB stack. (1) is easier to run into too – you only need a single deep stack, and be calling it from a single task. But (2) requires an additional condition – you have to also be spawning a lot of tasks, all of which are deep. With a 8 MB stack size limit, we would basically swap (1) for (2). What do you think? |
Also: OOM errors are loud, and show up in the root task. But if a secondary task is the only one to experience a stackoverflow, and the user doesn't explicitly have a julia> function test_recursion_depth(maxdepth, depth=0, args...)
depth >= maxdepth && return nothing
print("\33[2K\rHello from depth $(depth)")
test_recursion_depth(maxdepth, depth + 1, args...)
print(devnull, "$(depth) $(args)") # Just to prevent LLVM removing args
end;
julia> test_recursion_depth(60_000) # Works fine since root task is 8 MB
Hello from depth 59999
julia> t = Threads.@spawn test_recursion_depth(60_000) # Crashes since thread is 4 MB
Task (runnable) @0x0000000280b0c1a0
Hello from depth 43325
julia> Which means you might have non-root threads crash and not realise, apart from reduced performance. |
That is also usually some sort of implementation bug, either with failing to call This proposal does nothing to actually guarantee task space, as that may be already consumed by any arbitrary amount of other code already. So if the code is using recursion badly, then that needs fixed in the user code, as no amount of stack space will ever be sufficient to correct for it. |
Of course – a footgun still requires the programmer to pull the trigger. Having smaller footguns is still a good thing though! Two questions:
|
Even 8 MB is comparatively small to other languages when you consider the larger stack frame size in Julia. Here's C++: #include <iostream>
void test_recursion_depth(long long maxdepth, long long depth = 0) {
if (depth >= maxdepth) return;
std::cout << "\33[2K\rHello from depth " << depth << std::flush;
test_recursion_depth(maxdepth, depth + 1);
std::cout << ""; // Prevent compiler optimizations
}
int main() { test_recursion_depth(1000000); } which goes up to 174,271 on my machine. In Julia, the analogous code goes up to 86,649. And within a thread, it only goes up to 42,984. |
I think a lot of the reason Julia opts for a smaller stack is that when creating lots of tasks, you don't want too high of a memory footprint. |
This isn't the stack size though, it's the stack size limit. Changing the limit by itself would have no effect on memory. See #54998 (comment) |
Here, you can try for yourself by creating a 10 TB stack for a task: julia> Task(() -> sleep(10), 10 * 1024 ^ 4) |> schedule |> fetch In other words, if you don't actually use the larger stack, there are no extra allocations. At the same time, if you are launching many many tasks, those tasks probably do something small, and aren't going to make function calls 50,000 recursions deep (or else you would have other problems). The benefit of a larger default task stack size limit is so you don't run into hard-to-debug errors like described above – due to the significant mismatch in stack size limits between root and secondary threads. Especially since Julia inference involves some deep recursive calls, sometimes it's not even the user's fault, and they end up with a stackoverflow in a thread without any error in the root process. |
Is it possible to make this runtime configurable for folks who need it / want to experiment with it, with appropriate warnings? |
Perhaps you're correct, I don't know, but I notice you ignored the first sentence by vtjnash above:
To try to properly test the statement by @vtjnash, I devised an experiment like this, trying to see for what values will I get OOM: function make_task(stack_size::Int)
f = () -> sleep(10)
r = Task(f, stack_size)
r.sticky = false
r
end
function make_tasks(task_count::Int, stack_size::Int)
[make_task(stack_size) for _ ∈ 1:task_count]
end
function run_tasks(tasks)
foreach(schedule, tasks)
end
function experiment(task_count::Int, stack_size::Int)
tasks = make_tasks(task_count, stack_size)
run_tasks(tasks)
tasks
end I get this: julia> experiment(30000, 4*1024*1024)
ERROR: OutOfMemoryError()
Stacktrace:
[1] _Task
@ ./boot.jl:523 [inlined]
[2] Task
@ ./task.jl:5 [inlined]
[3] make_task
@ ./REPL[1]:3 [inlined]
[4] #3
@ ./none:-1 [inlined]
[5] iterate
@ ./generator.jl:48 [inlined]
[6] collect_to!
@ ./array.jl:829 [inlined]
[7] collect_to_with_first!
@ ./array.jl:807 [inlined]
[8] collect(itr::Base.Generator{UnitRange{Int64}, var"#3#4"{Int64}})
@ Base ./array.jl:781
[9] make_tasks
@ ./REPL[2]:2 [inlined]
[10] experiment(task_count::Int64, stack_size::Int64)
@ Main ./REPL[4]:2
[11] top-level scope
@ REPL[5]:1 I think there may be a bug in the |
That's a good idea. Here is a modified version with printing so we can test the # of tasks before OOM errors: const TASK_NUM = Ref(0)
function make_task(stack_size::Int)
TASK_NUM[] += 1
print("\33[2K\rHello from task ", TASK_NUM[])
f = () -> sleep(10)
r = Task(f, stack_size)
r.sticky = false
r
end
function make_tasks(task_count::Int, stack_size::Int)
TASK_NUM[] = 0
[make_task(stack_size) for _ ∈ 1:task_count]
end
function run_tasks(tasks)
foreach(schedule, tasks)
end
function experiment(task_count::Int, stack_size::Int)
tasks = make_tasks(task_count, stack_size)
run_tasks(tasks)
tasks
end
experiment(1000000, parse(Int, ARGS[1]) * 1024 * 1024) with this I get > julia --startup-file=no test_memory.jl 4 # 4 MB tasks (default)
Hello from task 29983
ERROR: OutOfMemoryError() and with an 8 MB stack, > julia --startup-file=no test_memory.jl 8 # 8 MB tasks
Hello from task 29983
ERROR: OutOfMemoryError() So, it doesn't seem to change things Increasing the stack size limit, I get identical behavior for:
Then, when I reach 8192 MB of stack size limit per task, only then does the OOM starts to occur earlier, down to 16,334 tasks. I see the same behavior on Julia 1.6.7 through 1.11-rc1. So, my feeling is that changing from 4 MB to 8 MB is pretty harmless, because the OOM error seems to be primarily a function of the number of tasks rather than the stack size limit per task (which shouldn't affect things anyways, unless we were asking for terabytes of address space per task. Virtual address space is huge). I think it's a clear win for making debugging easier, as threads will not experience stack overflows 2x earlier than the root task. Wdyt? |
With my PR #55201 we can also test this with function make_task(stack_size::Int)
TASK_NUM[] += 1
print("\33[2K\rHello from task ", TASK_NUM[])
Threads.@spawn reserved_stack=stack_size sleep(100)
end The results are identical to the above, with regular I tested |
I have reproduced these experiments on a Linux machine with a very different memory profile than my mac. The number of tasks before OOM error is nearly the same (29,888 vs 29,982) – up to ~4,000 MB stack size limit per task. At precisely a 3,742 MB stack size limit is the point at which I start to see a reduction in the max number of tasks. But before that, it's a flat 29,982 task limit on my linux machine. So from these experiments it seems like an adjusted stack size of 8 MB per task does not actually change the occurrence of OOM errors, across Julia versions (tested v1.6.7 - v1.11.0-rc1) and operating systems (tested Linux and macOS – both 64-bit). cc @vtjnash |
Also… should this be flagged in a bug report? 30,000 tasks before OOM error seems small, no? And clearly the main contributor is not the stack size, must be something else. |
Why limit this to 64-bit systems? Given on 32-bit: "glibc i386, x86_64 7.4 MB". Is it to limit testing? It seems if good for 64-bit should also be for 32-bit, since only a limit. |
So, 32-bit systems have only 2^32 bytes of available virtual address space, which is 4 GB. This means that stack space limits are actually something to worry about for 32-bit. Even the current value of 2 MB is perhaps a bit large on 32-bit. However, 64-bit systems have 2^64 bytes of available virtual address space, which is 16 exabytes. Basically it's so large we don't need to worry about it at all. The practical reason for setting the default stack size limit to number of MBs is to discourage users from using large stacks. But since the root task in Julia already has an 8 MB stack size limit, it would make life much simpler if threads have 8 MB stack size limits too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems sensible to me and I see no real downside. All arguments against that I heard so far seem (up to my understanding -- of course I may have misunderstood something) to either miss that this is about a limit, not actual allocations; discuss hypothetical issues (e.g. with OOM) that are disproven by experiments (and to make also don't make sense from a theoretical point of view). And while I agree that if you need many tasks with large stacks then perhaps you need to go back to the drawing board, I think this kind of argument can equally be brought against a 4 MB stack (why not make the stack just 2MB, matching 32 bit system, or just 1 MB?). Given this flexible nature, I find the argument "this matches the default stack size of the main thread on most 64bit systems" to be very compelling.
I also feel that if we merge this now - it is still early in the 1.12 release cycle, and we will have enough time to react to issues, or undo if necessary. |
A smaller number would also be beneficial, since it would permit doing more optimizations. We are at sort of an unfortunately large size right now, where the optimizations possible aren't quite as substantial. |
How substantial are those possible optimizations? |
It doesn't fix his issue though, it merely slightly pushes off when the underlying issue needs to be fixed, and in the meantime makes the time for it to finally crash slightly longer, the eventual stacktrace slower, and more likely for tooling to fail that tries to point at the actual cause of failure (as they tend to have limits in the 10k frame ranges). |
Can we measure these things? Since the effect on OOM errors evidently does not appear until 1000x increases in stack size limits, perhaps the same will be true for these other theoretical optimizations. With #55184 merged I think it’s good to note that this is just the default behavior. If someone needs custom stack size limits, for some nonstandard OS where having small stack size limits is very important, there’s now a documented way for an advanced user to do that. But otherwise I think the default stack size limit should be the same as the root task, otherwise it’s a needless footgun. |
Ok I just spun up a Windows machine in AWS to test this. Running the code from this, The results are as follows:
So, again, it seems that changing from 4 MB default to 8 MB default does not effect the OOM error. Something else, other than the stack size limit, is the main cause. If Windows is still a concern though, maybe a compromise is that we could raise it to 8 MB on Linux and macOS, and leave it at 4 MB on Windows? (That being said, the maximum number of tasks on Windows before an OOM is tiny regardless of the stack size limit... Any idea why? Is this just an anomalous measurement from my cloud environment?) |
Why is this the case? This doesn't seem supported by the above experiments. Does Julia does something non-standard with call stacks that can cause reliability issues when asking for more virtual address space? Maybe you could share an example of this behavior so we can analyze it? |
For the record – I would also be reasonably content with a 4MB/4MB stack size limit for both the main thread and secondary threads. But, of course, that's not possible, because 8MB is the default for the main thread in most modern systems. And I think 8MB/8MB is much better than a 8MB/4MB which is a big footgun. I think the very best solution would be to have the stack size of a |
@MilesCranmer regarding our experiments above, they are not valid, because setting the second parameter of the This also explains the overly pessimistic results. |
That makes much more sense, thanks! I'll try with a custom build. |
New experiments: const TASK_NUM = Ref(0)
function make_task()
TASK_NUM[] += 1
TASK_NUM[] % 1000 == 0 && print("\33[2K\rHello from task ", TASK_NUM[])
r = Task(() -> sleep(10^10))
r.sticky = false
r
end
experiment(task_count::Int) = (TASK_NUM[] = 0; [make_task() for _ ∈ 1:task_count])
experiment(1000000000) Now, rather than using
So I basically get the same number of tasks, within statistical noise, before OOM. |
If that helps with your experiments, if you want to only change Line 113 in 1dee000
make -j -C src , instead of rebuilding whole of Julia.
|
Thanks. I also tried with an explicit |
If we really wanted to, we could probably manage to limit the stack size of the main task to something less than the default, but I don't see a really compelling reason that 4mb is the right size, so we might as well go with this. |
That is just an rlimit, which is trivial to set from Julia (normally we only query it), though I don't see much value in doing so
That is because Julia stops honoring it once you hit about 10k Tasks and changes to a different internal Task mode (slow, because we cannot optimize it while this limit is above 2MB), because it is trying to avoiding a different limit in the kernel |
That is a Julia-imposed limit, not a kernel limit, so you aren't really measuring what you set out to measure |
@vtjnash here are the new experiments after our realization that |
Right now we just fail to allocate them entirely after about 10k (the exact number is OS depending) and have some tricks to lie about it instead so that you don't notice too easily. This lie is necessary because the 4MB default is just a little too large to effectively pool the allocations. This is all transparent to you in your tests, as it was supposed to be hidden from you, but it does negate your attempts to test it |
I’m not sure I follow. If the testing has an issue perhaps it would be easier if you describe how it should be modified? Earlier you explained how a larger stack size limit could increase occurrence of OOM errors, so these experiments are designed to test the magnitude of that potential problem. So far it seems to not be an issue in practice though. |
IIUC, if we are not allocating after 10k threads, then what is the harm to increase the limit per thread anyways? @MilesCranmer Just out of curiosity, if we reduce the main thread to 4MB also, does it fix the enzyme crash? |
I can check. My guess is that it would just cause the type inference stack overflow to show up in the main thread in addition to the secondary, which would be much easier for debugging. Basically the 2x mismatch in stack size limits between the main thread and secondary threads resulted in this race conditioned stack overflow that I found quite challenging to debug. The difficulty was compounded by caching which meant the secondary thread could run fine so long as the main thread was the one to compile the code first. (This was the main trigger for me making this PR.) A stack size limit of 1 MB, 2 MB, etc, would all be fine, so long as they are the same as the secondary thread stack size limit. This PR sets it to 8 MB since that’s the usual main thread stack size limit, but in principle I’d be ok with reducing both. It’s just that reducing the main thread stack size would be a major breaking change so is Julia 2.0 material. And also from all the experiments I’ve ran, it seems Julia’s call stacks are normal in that they only allocate physical memory when used, not when they are instantiated. So, based on the numbers, there seems to be no downside. |
I agree we should make stack sizes the same for all tasks; we don't have to promise anything about whether you get stack overflows, but it should at least not depend on which task the code runs on.
This is interesting to think about tangentially --- I personally don't think changes involving resource use can count as breaking changes. For example, the representation of an object getting larger could lead to more OOMs, but changing the representation of an object is clearly allowed at least as a minor version change. A reasonable stack size decrease, to me, would be similar. |
Yeah, I guess there's sort of a blurred decision boundary here. The main thread stack size of 8.0 MB changing to 7.5 MB seems in line with normal minor version changes, whereas 8.0 MB to 1.0 MB would mean some libraries that rely on deep stacks may need to be rewritten entirely. |
Fwiw, it is actually abnormal for others threads to have as much stack space as the main thread. In glibc the default is potentially as low as 16kb! So if you use threads, at least some subset of your tasks may have much smaller stacks that are impossible to increase |
I think we're in a very different situation here since Julia doesn't let you control (natively at least) what thread your task spawns on. |
Also, from the post you refer to it looks like this is from an old operating system HP-UX, where I am guessing the main thread size is probably equivalently tiny. |
One problem here is that, while we'd all in theory be OK with reducing thread 1's stack to make them consistent, we can't do that because the user might have manually set the limit with Based on the experiments here so far, (1) looks fine, but 8mb is really very big compared to how much stack typical programs use. It's entirely possible we will get some advantages in the future by reducing the size. For example we currently allocate one stack at a time with mmap, but we should be pooling them and allocating many in a single mapping. At that point, decreasing the stack size directly increases the number of tasks we can efficiently handle. Unfortunately we won't have those performance numbers until that is actually implemented. But I'm not sure it will always be the best tradeoff to promise consistent stack sizes for all tasks. |
I will merge this for now in case it's helpful to anyone. However in the future we may try to optimize tasks further, and if we can show good numbers from shrinking the stacks again we'll consider doing it. |
Sounds very reasonable to me |
This increases the default stack size limit on 64-bit systems from 4 MB to 8 MB, matching glibc and typical modern Linux and macOS machines, as well as the stack size limit of the root Julia process.
This increases the default stack size limit on 64-bit systems from 4 MB to 8 MB, matching glibc and typical modern Linux and macOS machines, as well as the stack size limit of the root Julia process. Note that this is a limit rather than an allocation, and only results in memory usage if used. A larger limit by itself does not change the memory usage of Julia [1] [2].
Since the root task already has an 8 MB default limit, a different stack size limit in tasks can lead to some hard-to-debug errors in multithreaded code which can't be reproduced in serial versions.
#55184 will also help address this issue, so that a user can manually adjust their own stack limits with a documented API. However I think an 8 MB stack size limit is a better default, and matches the default in a variety of systems (you can check your system's default with
ulimit -s
). 64-bit systems have 16 exabytes of virtual address space available, and stack size limits do not inherently affect performance, see my note here. They only affect performance if one actually uses the larger stack size with deeper function calls.Also see some stack size limit experiments here: https://discourse.julialang.org/t/experiments-with-julia-stack-sizes-and-enzyme/116511/2 which look at function nesting limits (which I have ran into myself when using AD libraries).
One alternative is for the stack size to be system-dependent, and computed based on the same information used by
ulimit -s
. However, I think this will make bugs harder to reproduce if workflows get close to stack size limits. A single limit across 64-bit systems seems reasonable (as is done currently).Fixes #54998. cc @ViralBShah @nsajko
It seems like there is an 8 MB default for the signal stack too:
julia/src/signals-unix.c
Lines 40 to 41 in 3290904