-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory corruption was detected in ERIS #27
Comments
Hi, do you think you can show the code that demonstrates the problem? That would give others a better chance to reproduce the problem and try to fix it. |
@cbeck88 , Unfortunately I cannot show original lua script which I debugged. I suppose it can be reproduced if lua script will contain many nested prototypes, or any other objects. Main purpose of such script is just to contain many objects to cause garbage collection. I can show workaround code, but it has no meaning without main lua script. In free time I will try to create simplified version of lua script which will break undumping. |
To clarify, is this with a C Closure or a Lua function? |
This is not a useful bug report if there isn't any kind of code example given -- it isn't reproduceable. @LmTinyToon If you can't give a MVCE (https://stackoverflow.com/help/mcve) then if it were my repository I would close the issue If OP has really disappeared then I would be inclined to assume that he figured it out it was a bug in his code and moved on. |
@cbeck88 Well, agreed that the bug report is lacking. But, I was drawn to this because I seem to be getting a similar issue in my code and have spent a few days trying to reproduce it. |
I'm having what is possibly the same issue. Unfortunately, it would be hard for me to post my whole test case (and I've tried to reproduce it in a simplified test case). However, running my production code in Valgrind seems to corroborate @LmTinyToon 's story.
|
Note that the error appears in multiple places, but the issue it is always the the same "block was free'd at". So I suspect this is indeed a bug. Further, when I run my code under heavy load it almost always crashes with a heap corruption message. Sometimes it's a double free, sometimes it's a corrupted double-linked list. Basically it's always heap corruption and it has only started happening since I introduced Eris on this project. |
Even more, all seems to be allocated at the same place and ultimately runs up against a double-free error. https://gist.github.com/ptwohig/5ddc97166a61af608b19cdfc346c9761 |
Actually, I just figured out the reproduction steps which work 100% of the time:
Same place in u_closure. I think @LmTinyToon is correct. I'm not sure that disabling garbage collection during will be a suitable workaround because this basically makes for a double-free regardless. |
Awesome, thank you 👍 I am interested to try to help fix this if that is welcome but I can't make time to do anything here until the weekend |
Here's where I think the offending lines are: LClosure *cl;
Proto *p;
eris_checkstack(info->L, 4);
/* Create closure and anchor it on the stack (avoid collection via GC). */
cl = eris_newLclosure(info->L, nups);
eris_setclLvalue(info->L, info->L->top, cl); /* ... lcl */
api_incr_top(info->L);
/* Preregister closure for handling of cycles (upvalues). */
registerobject(info);
/* Read prototype. In general, we create protos (and upvalues) before
* trying to read them and pass a pointer to the instance along to the
* unpersist function. This way the instance is safely hooked up to an
* object, so we don't have to worry about it getting GCed. */
pushpath(info, ".proto");
cl->p = eris_newproto(info->L);
/* Push the proto into which to unpersist as a parameter to u_proto. */
lua_pushlightuserdata(info->L, cl->p); /* ... lcl nproto */
unpersist(info); /* ... lcl nproto nproto/oproto */ Consider this scenario:
I suspect that's why the author's of Lua included that HARDMEMTESTS to force a failure as soon as possible in testing. |
I am sorry for late. Unfortunately I cannot give original code which represents bug. I tried to reproduce but it does not have effect. |
@ptwohig, Yes, it seems like you have the same problem (I remember similar stack after corruption). It was "floating" bug. Sometimes it worked sometimes not. Temporal disabling of garbage collector during work of eris fixed problem (anyway it helped me and bug disappeared) |
@LmTinyToon I tried the disabling of GC and it still crashed on me. Granted I had a few hundred threads working a few hundred separate Here is something that is definitely wrong, though: p->sizecode = READ_VALUE(int);
eris_reallocvector(info->L, p->code, 0, p->sizecode, Instruction); And observe what happens beginning on lgc.c:479 static int traverseproto (global_State *g, Proto *f) {
int i;
if (f->cache && iswhite(f->cache))
f->cache = NULL; /* allow cache to be collected */
markobjectN(g, f->source);
for (i = 0; i < f->sizek; i++) /* mark literals */
markvalue(g, &f->k[i]);
for (i = 0; i < f->sizeupvalues; i++) /* mark upvalue names */
markobjectN(g, f->upvalues[i].name);
for (i = 0; i < f->sizep; i++) /* mark nested protos */
markobjectN(g, f->p[i]);
for (i = 0; i < f->sizelocvars; i++) /* mark local-variable names */
markobjectN(g, f->locvars[i].varname);
return sizeof(Proto) + sizeof(Instruction) * f->sizecode +
sizeof(Proto *) * f->sizep +
sizeof(TValue) * f->sizek +
sizeof(int) * f->sizelineinfo +
sizeof(LocVar) * f->sizelocvars +
sizeof(Upvaldesc) * f->sizeupvalues;
} The garbage collector attempts to iterate each array of objects and mark them individually, however if a value exists (for example) for p->sizek and the pointer is 0, a segfault will happen. More specifically if it is possible that a garbage collection cycle hits between the assignment of one of the sizes and the assignment/allocation of the array holding the data we end up with a hard crash. Changing everywhere in eris.c to code similar to this seems to have fixed that problem and all tests pass now. {
int psizek = READ_VALUE(int);
eris_reallocvector(info->L, p->k, 0, psizek, TValue);
p->sizek = psizek;
} I'm still not convinced that there isn't a need for barriers to prevent memory corruption but this is definitely problematic as well. Granted that the particular above segment can't trigger a garbage collection cycle, similar parts of the code can potentially do so as well. I also noticed that between Lua 5.2 and Lua 5.3 they removed a proto barrier function, so maybe this was the cause of it. It's very hard to say right now until I do more digging. |
I had the same problem. Turns out, this is just a missing luaC_objbarrier. In u_proto, look for this line of code: p->p[i] = eris_newproto(info->L); In this case, if proto p is marked "black", we need a luaC_objbarrier. I preceded this line of code with this: assert(!isblack(obj2gco(p))); And sure enough, I got a couple of assertion fails. So definitely, proto p can be black. I was able to fix it by changing the code as follows: p->p[i] = cp = eris_newproto(info->L); If you want to see the bug yourself, this is very hard to do. You have to get the garbage collector to mark proto p black at just the wrong moment. It's extremely unlikely, unless you're saving and loading a lot of nested protos. I have not been able to write a small test case that triggers the bug. Despite that, I can assure you that the bug does happen, and that the fix above works. Note that the "reallocvector" bug documented above, by ptwohig, is a completely separate bug. His repair works for that bug. |
Hi. I learned internals of eris little bit during debugging undump of lua script. As I understood, there is memory corruption which occurs during undumping script (script creates not many objects (like closures etc). This is happened due inproper (I guess) working with marks of gco objects. Here is brief example of corruption:
I suppose that some work with garbage collector barriers is needed because following steps fix problem:
The text was updated successfully, but these errors were encountered: