Replies: 5 comments 5 replies
-
Hmm... good question. My gut feeling is that this particular cache is not the culprit, but it should be easy to test. Since the cache is only used as an optimization, you could rewrite pallene_getstr so that it always ignores the cache. One thing that I haven't done yet, but might help to debug these nasty bugs, would be to run Lua in debug mode with assertions turned on. See the LUAI_ASSERT define and the lua_assert macro that's inside some Lua and Pallene C code. If you could figure out the appropriate incantations to turn on these asserts it would be quite helpful for us in the future. |
Beta Was this translation helpful? Give feedback.
-
Thanks!!! Could you please open a PR documenting how to do the assertion thing? We should probably start testing Pallene with these assertions from now on, to avoid similar trouble in the future. By the way, do you know if the existing test suite can trigger these assertions?
Since this is a stack overflow, perhaps we could try a Pallene program that's a single recursive function, called recursively many times? Possiblily with 7+ return values, if that helps.
Would it be possible to find out how big the stack was when the crash happened? Total number of Lua and C calls? If it's a smallish number (less than 100), I think that would narrow the possibilities for the bug.
Next step we need to find this bug and squash it >:) I wouldn't be surprised if it's some embarassing bug in the Pallene generated code for function calls. That code needs to carefully ensure that it grows the stack when it enter the function and that it updates the various stack pointers whenever Lua realocates the stack (due to the stack growing large enough). This is tricky and has been a source of bugs before. If we could find a small reproducible example it would sure help. (I'd try a recursive function) |
Beta Was this translation helpful? Give feedback.
-
Hmm. Once we figure out that reproducible test case we should add it to the suite. Maybe it's something that the test suite isn't testing well right now... For example, back and forth Lua<->Pallene calls or functions with large number of return values
Currently the Pallene documentation is inside our git repository. I was thinking about adding those debugging tips to the CONTRIBUTING.md file.
This sounds about right! Just test it to double check it, before opening the PR.
What operating system are you running it on?
That's ok.
It does. There is also a mechanism for limiting the number of Lua->C calls, which may or may not be related to our bug here. (LUAI_MAXCCALLS)
Sure. The idea of using a recursive program is to make a smaller test case that also has many function calls and grows the call stack.
Perhaps... Could you please provide your minimal test case that segfaults so I can test it here? When I tried I couldn't get my programs to segfault on my Fedora laptop. What OSare you using now?
Yes, IIRC they are pointers. I believe you can printf them with
The pointer arithmetic for pointers to unions works the same as pointers to non-unions. For our purposes, you can sort of pretend that the StackValue is the same thing as a TValue. The other case in the union is for the to-be-closed variables which were added in 5.4 but Pallene doesn't implement those yet.
Indeed, they are not arbitrary, they have to point to the same Lua stack. IMO the best documentation of what the top, ci-top, etc are is Dibyendu's bytecode reference, found here: https://github.com/dibyendumajumdar/ravi/blob/master/readthedocs/lua_bytecode_reference.rst
My gut feeling is that an integer overflow bug doesn't sound likely. My first expectation would be a bug that's an out-of-bounds pointer access. One kind of bug that has happened before is if we don't grow the stack enough at the start of the function, perhaps because we told it to grow by less than it needed. Another thing that has happened before is if Lua reallocates the stack in response to the stack growing but we don't update the corresponding the |
Beta Was this translation helpful? Give feedback.
-
Hello, I was just wondering if you've made any progress on this bug. A slight tangent, but I decided to try using LuaJIT as a temporary workaround. I just did a benchmark of a small/medium size run comparing LuaJIT and Pallene, no Lua Lanes in this run... Pallene was almost twice as fast as LuaJIT for my program (ignoring runs where Pallene fails due to this bug). (And I disabled the LUAI_ASSERT for this benchmark.) LuaJIT 2.1.0-beta3: Pallene: Lua 5.4 I think that's pretty impressive. I don't really know the reasons for this. The program is not optimized...just written to try to solve my problem. The program does a lot of simple math in nested for-loops. But the program as a whole also has a lot of control flow stuff. If I had to guess as to the reason, the program is complicated enough that it is probably hard to unroll and vectorize a lot of things, and probably a the majority of the time is dealing with the control flow. And out of habit and training, in general my development style tends to try to avoid lots of memory allocations and deallocations, so I'm skeptical the timing differences are related to the garbage collector differences. Anyway, I think this performance is pretty promising for Pallene. I think my program is more representative of what real world programs look like as opposed to micro-benchmarks, so hopefully Pallene will be able to bring substantial performance gains like these to more real world programs. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately I haven't had a chance to look into it yet. I'm wrtiting a paper and there's a close deadline coming up... |
Beta Was this translation helpful? Give feedback.
-
I started using Lua Lanes with Pallene. I sometimes hit assertion failures in Pallene, such as:
"wrong type for array element, expected float but found table"
and
"wrong type for array element, expected float but found table"
For my code, these things were created in Pallene, so the types should never change. So it felt almost like a race-condition that somehow code was being run before Pallene finished initializing the type.
This case is really hard to reproduce. If I re-run exactly as before, often I will not see the problem. And I never get this problem if I try to do a non-Lanes run. But I am trying to do some big runs that take many hours (hence why I am using Lanes), so after a long enough time, this assertion failure can pop up and completely ruin a run.
I've been banging my head on this for a week now. I've tried to make a simple isolated reproducible test, but I have been unable to trigger it with that so far.
So I tried to think what in Pallene could cause Lanes (multiple threads or multiple VMs) to break. Searching through the code, one thing that caught my eye is the use of static int, e.g.
Thinking out loud, I'm wondering if the multiple Lanes VMs are fighting over this same static int in C, which is causing the race-condition-like behavior I'm experiencing.
If you think think this is a legit problem, can we try to fix this so it can become multiple VM and Lanes friendly?
If not, any other ideas on what might be the root of my problem?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions