Skip to content

corerun with DOTNET_EnableEventLog: "Heap contamination detected! "HeapFree was called on a heap other than the one that memory was allocated from." #56888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
am11 opened this issue Aug 5, 2021 · 34 comments
Assignees
Milestone

Comments

@am11
Copy link
Member

am11 commented Aug 5, 2021

Based on current main (5bef921), build product and tests in Debug configuration on Ubuntu 18.04 x64:

./build.sh
./src/tests/build.sh -p:LibrariesConfiguration=Debug

Running a simple hello world app works:

cd artifacts/bin/testhost/net6.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0
corerun /tmp/helloworld.dll

running the same app with event log enabled fails an assertion:

DOTNET_EnableEventLog=1 corerun /tmp/helloworld.dll

Assert failure(PID 543 [0x0000021f], Thread: 543 [0x021f]): !"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\n" "Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it."
    File: /runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp Line: 296
    Image: /runtime/artifacts/bin/testhost/net6.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0/corerun

corehost, OTOH, has no issue with DOTNET_EnableEventLog=1 (or COMPlus_EnableEventLog=1).

@ghost
Copy link

ghost commented Aug 5, 2021

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 5, 2021
@am11
Copy link
Member Author

am11 commented Aug 5, 2021

cc @josalem, @janvorli
I found this when writing LTTng tests (wip branch: main...am11:feature/tracing/tests/lttng-events) and reduced it to a simple hello world. Setting DOTNET_EnableEventLog=1 on anything chokes corerun.

@josalem
Copy link
Contributor

josalem commented Aug 5, 2021

I'm not sure what would be different for LTTng between corehost and corerun. That seems odd. Is there any more to the stack when the assert is thrown? clrhost_nodependencies.cpp:296 is just ClrFree. I'll see if I can get an LTTng environment set up. If you can easily reproduce this though, it might be faster if you can run it under LLDB and get a bt at the assert.

(lldb corerun -- helloworld.dll -> (lldb) settings set target.env-vars DOTNET_EnableEventLog=1 -> etc.)

@am11
Copy link
Member Author

am11 commented Aug 5, 2021

This is only happening in debug build, I tested with release build and the test passed.

Here is the backtrace of helloworld.dll from lldb, when launching with corerun:

root@968b4a55cce2:/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root# lldb -f ./corerun /tmp/helloworld/bin/Release/net6.0/helloworld.dll
(lldb) target create "./corerun"
Current executable set to './corerun' (x86_64).
(lldb) settings set -- target.run-args  "/tmp/helloworld/bin/Release/net6.0/helloworld.dll"
(lldb) settings set target.env-vars DOTNET_EnableEventLog=1
(lldb) r
Process 1756 launched: './corerun' (x86_64)
      
Assert failure(PID 1756 [0x000006dc], Thread: 1756 [0x06dc]): !"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\n" "Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it."
    File: /runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp Line: 296
    Image: /runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun

Process 1756 stopped
* thread #1, name = 'corerun', stop reason = signal SIGTRAP
    frame #0: 0x00007ffff6b52559 libcoreclr.so`DBG_DebugBreak at debugbreak.S:9
   6   	
   7   	LEAF_ENTRY DBG_DebugBreak, _TEXT
   8   	        int3
-> 9   	        ret
   10  	LEAF_END_MARKED DBG_DebugBreak, _TEXT
   11  	
(lldb) bt all
* thread #1, name = 'corerun', stop reason = signal SIGTRAP
  * frame #0: 0x00007ffff6b52559 libcoreclr.so`DBG_DebugBreak at debugbreak.S:9
    frame #1: 0x00007ffff6aa43cb libcoreclr.so`::DebugBreak() at debug.cpp:406
    frame #2: 0x00007ffff680e828 libcoreclr.so`::DbgAssertDialog(szFile="/runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp", iLine=296, szExpr="!\"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\\n\" \"Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it.\"") at debug.cpp:697
    frame #3: 0x00007ffff67c13c2 libcoreclr.so`ClrFree(p=0x0000555555661c30) at clrhost_nodependencies.cpp:295
    frame #4: 0x00007ffff67c1175 libcoreclr.so`operator delete[](p=0x0000555555661c40) at clrhost_nodependencies.cpp:410
    frame #5: 0x00007ffff6bab80d libcoreclr.so`::FireEtXplatRuntimeInformationStart(ClrInstanceID=0, Sku=2, BclMajorVersion=6, BclMinorVersion=0, BclBuildNumber=0, BclQfeNumber=0, VMMajorVersion=42, VMMinorVersion=42, VMBuildNumber=42, VMQfeNumber=42424, StartupFlags=8388611, StartupMode='\0', CommandLine=u"", ComObjectGuid=0x00007ffff5fa9600, RuntimeDllPath=u"/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so") at eventprovdotnetruntime.cpp:4905
    frame #6: 0x00007ffff665ee35 libcoreclr.so`FireEtwRuntimeInformationStart(ClrInstanceID=0, Sku=2, BclMajorVersion=6, BclMinorVersion=0, BclBuildNumber=0, BclQfeNumber=0, VMMajorVersion=42, VMMinorVersion=42, VMBuildNumber=42, VMQfeNumber=42424, StartupFlags=8388611, StartupMode='\0', CommandLine=u"", ComObjectGuid=0x00007ffff5fa9600, RuntimeDllPath=u"/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so", ActivityId=0x0000000000000000, RelatedActivityId=0x0000000000000000) at clretwallmain.h:2811
    frame #7: 0x00007ffff664b58e libcoreclr.so`ETW::InfoLog::RuntimeInformation(type=0) at eventtrace.cpp:5111
    frame #8: 0x00007ffff6a73c1f libcoreclr.so`EEStartupHelper() at ceemain.cpp:767
    frame #9: 0x00007ffff6a75684 libcoreclr.so`EEStartup(this=0x00007fffffffda88, p=0x0000000000000000)::$_1::operator()(void*) const at ceemain.cpp:1153
    frame #10: 0x00007ffff6a72a5b libcoreclr.so`EEStartup() at ceemain.cpp:1155
    frame #11: 0x00007ffff6a72862 libcoreclr.so`EnsureEEStarted() at ceemain.cpp:321
    frame #12: 0x00007ffff6219bed libcoreclr.so`CorHost2::Start(this=0x0000555555634690) at corhost.cpp:101
    frame #13: 0x00007ffff6153006 libcoreclr.so`::coreclr_initialize(exePath="/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun", appDomainFriendlyName="corerun", propertyCount=4, propertyKeys=0x000055555558a0c0, propertyValues=0x000055555558a140, hostHandle=0x0000555555566068, domainId=0x0000555555566070) at unixinterface.cpp:251
    frame #14: 0x000055555555f127 corerun`run(config=0x00007fffffffe460) at corerun.cpp:352
    frame #15: 0x000055555555c616 corerun`main(argc=2, argv=0x00007fffffffe5d8) at corerun.cpp:554
    frame #16: 0x00007ffff6ca1bf7 libc.so.6`__libc_start_main(main=(corerun`main at corerun.cpp:546), argc=2, argv=0x00007fffffffe5d8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe5c8) at libc-start.c:310
    frame #17: 0x000055555555b02a corerun`_start + 42
  thread #2, name = 'corerun-ust'
    frame #0: 0x00007ffff79c2657 libpthread.so.0`__libc_recvmsg + 71
    frame #1: 0x00007ffff5985c23 liblttng-ust.so.0`ustcomm_recv_unix_sock + 179
    frame #2: 0x00007ffff598b241 liblttng-ust.so.0`___lldb_unnamed_symbol62$$liblttng-ust.so.0 + 2241
    frame #3: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #4: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #3, name = 'corerun-ust'
    frame #0: 0x00007ffff6d9b639 libc.so.6`syscall at syscall.S:38
    frame #1: 0x00007ffff598abd3 liblttng-ust.so.0`___lldb_unnamed_symbol62$$liblttng-ust.so.0 + 595
    frame #2: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #3: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #4, name = 'corerun'
    frame #0: 0x00007ffff6d94cb9 libc.so.6`__GI___poll(fds=0x00007fffed55ca20, nfds=1, timeout=-1) at poll.c:29
    frame #1: 0x00007ffff6b30459 libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(this=0x0000555555638900, iTimeout=-1, pRecvBuf="", iBytes=1) at synchmanager.cpp:2233
    frame #2: 0x00007ffff6b2f618 libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadCmdFromProcessPipe(this=0x0000555555638900, iPollTimeout=-1, pswcWorkerCmd=0x00007fffed55cde4, pshridMarshaledData=0x00007fffed55cdd8, pdwData=0x00007fffed55cdd4) at synchmanager.cpp:2011
    frame #3: 0x00007ffff6b2e1c6 libcoreclr.so`CorUnix::CPalSynchronizationManager::WorkerThread(pArg=0x0000555555638900) at synchmanager.cpp:1714
    frame #4: 0x00007ffff6b4d315 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x0000555555639620) at thread.cpp:1862
    frame #5: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #6: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #5, name = 'corerun'
    frame #0: 0x00007ffff6d94cb9 libc.so.6`__GI___poll(fds=0x00007fff70008ee0, nfds=1, timeout=-1) at poll.c:29
    frame #1: 0x00007ffff689db13 libcoreclr.so`ipc_poll_fds(fds=0x00007fff70008ee0, nfds=1, timeout=4294967295) at ds-ipc-pal-socket.c:485
    frame #2: 0x00007ffff689d904 libcoreclr.so`ds_ipc_poll(poll_handles_data=0x00007fffecd5bb60, poll_handles_data_len=1, timeout_ms=4294967295, callback=(libcoreclr.so`server_warning_callback(char const*, unsigned int) at ds-server.c:110))(char const*, unsigned int)) at ds-ipc-pal-socket.c:1082
    frame #3: 0x00007ffff6a5177b libcoreclr.so`ds_ipc_stream_factory_get_next_available_stream(callback=(libcoreclr.so`server_warning_callback(char const*, unsigned int) at ds-server.c:110))(char const*, unsigned int)) at ds-ipc.c:395
    frame #4: 0x00007ffff6a4eba7 libcoreclr.so`server_thread(data=0x0000000000000000) at ds-server.c:127
    frame #5: 0x00007ffff6b4d315 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x0000555555657640) at thread.cpp:1862
    frame #6: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #7: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95

and without settings set target.env-vars DOTNET_EnableEventLog=1, it works fine:

(lldb) r
Process 1576 launched: './corerun' (x86_64)
Hello World!
Process 1576 exited with status = 0 (0x00000000) 

helloworld.dll was generated with:

$ /runtime/.dotnet/dotnet  new console -o /tmp/helloworld
$ cd /tmp/helloworld
$ /runtime/.dotnet/dotnet build -c Release

@janvorli
Copy link
Member

janvorli commented Aug 5, 2021

I wonder if it is yet another case of a mismatch between standard new operator and our overriden delete[], like we've seen in mscordaccore. Isn't this code by any chance using the STL? That would be bad.

@janvorli
Copy link
Member

janvorli commented Aug 5, 2021

Ok, seems no STL is there. But I can see the problem. The ResizeBuffer is in PAL and as such uses standard new operator. The code that asserts is built on top of PAL though (like runtime) and so it cannot use the standard delete operator and uses our override instead.
The reason why it works in release builds is that we add extra information into the memory around the allocation only in non-release builds. So for release builds, our ClrFree becomes just free and it works.

@janvorli
Copy link
Member

janvorli commented Aug 5, 2021

It seems the right fix would be to implement a function to delete the buffer in the eventproviderhelpers.cpp like the ResizeBuffer and call it instead of the delete. That way both new and delete would be guaranteed to be compatible.

@ghost
Copy link

ghost commented Aug 5, 2021

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

Based on current main (5bef921), build product and tests in Debug configuration on Ubuntu 18.04 x64:

./build.sh
./src/tests/build.sh -p:LibrariesConfiguration=Debug

Running a simple hello world app works:

cd artifacts/bin/testhost/net6.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0
corerun /tmp/helloworld.dll

running the same app with event log enabled fails an assertion:

DOTNET_EnableEventLog=1 corerun /tmp/helloworld.dll

Assert failure(PID 543 [0x0000021f], Thread: 543 [0x021f]): !"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\n" "Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it."
    File: /runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp Line: 296
    Image: /runtime/artifacts/bin/testhost/net6.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0/corerun

corehost, OTOH, has no issue with DOTNET_EnableEventLog=1 (or COMPlus_EnableEventLog=1).

Author: am11
Assignees: -
Labels:

area-Diagnostics-coreclr, untriaged

Milestone: -

@josalem josalem removed the untriaged New issue has not been triaged by the area owner label Aug 5, 2021
@josalem
Copy link
Contributor

josalem commented Aug 5, 2021

To make sure we're on the same page, this is the ResizeBuffer you're talking about?

bool ResizeBuffer(char *&buffer, size_t& size, size_t currLen, size_t newSize, bool &fixedBuffer)
{
newSize = (size_t)(newSize * 1.5);
_ASSERTE(newSize > size); // check for overflow
if (newSize < 32)
newSize = 32;
// We can't use coreclr includes here so we use std::nothrow
// rather than the coreclr version
char *newBuffer = new (std::nothrow) char[newSize];
if (newBuffer == NULL)
return false;
memcpy(newBuffer, buffer, currLen);
if (!fixedBuffer)
delete[] buffer;
buffer = newBuffer;
size = newSize;
fixedBuffer = false;
return true;
}

Looks like EventPipe got around this by having ResizeBuffer re-defined on top of the PAL:

#include "common.h"
#include <stdlib.h>
#include <string.h>
#ifndef TARGET_UNIX
#include <windef.h>
#include <crtdbg.h>
#else
#include "pal.h"
#endif //TARGET_UNIX
bool ResizeBuffer(BYTE *&buffer, size_t& size, size_t currLen, size_t newSize, bool &fixedBuffer)
{
newSize = (size_t)(newSize * 1.5);
_ASSERTE(newSize > size); // check for overflow
if (newSize < 32)
newSize = 32;
BYTE *newBuffer = new (nothrow) BYTE[newSize];
if (newBuffer == NULL)
return false;
memcpy(newBuffer, buffer, currLen);
if (!fixedBuffer)
delete[] buffer;
buffer = newBuffer;
size = newSize;
fixedBuffer = false;
return true;
}

@janvorli
Copy link
Member

janvorli commented Aug 6, 2021

Yes, I was talking about the one in the eventproviderhelpers.cpp. If you added a DeleteBuffer function there and called it instead of using the delete in the generated eventprovdotnetruntime.cpp, it would fix the issue. The way eventpipe did that would work as well, but it seems that the way I've suggested is kind of cleaner.

@josalem
Copy link
Contributor

josalem commented Aug 6, 2021

Thanks for the triage @janvorli. @am11 if you aren't planning to include that change in your PR, I'll go ahead and make it.

@josalem josalem self-assigned this Aug 6, 2021
@josalem josalem added this to the 6.0.0 milestone Aug 6, 2021
@am11
Copy link
Member Author

am11 commented Aug 6, 2021

I am not sure how would the prospective DeleteBuffer look like, since we can neither use ClrFree (with corehost) nor delete (which is a placement operator in corerun). 😅

On unix, based on git grep 'new.*std.*nothrow' ':(top,icase,exclude)*win*', we have only used std::nothrow in two other places one of which uses 'std::unique_ptr' and the other one ArrayHolder (both are based on RAII; automatically destroy object when control goes out of scope):

ArrayHolder<dyld_image_info> imageInfos = new (std::nothrow) dyld_image_info[dyldInfo.infoArrayCount];

std::unique_ptr<GCEvent::Impl> event(new (std::nothrow) GCEvent::Impl(false, initialState));

@janvorli
Copy link
Member

janvorli commented Aug 6, 2021

@am11, I am not sure I understand your question. The new function code would be in src/coreclr/pal/src/eventprovider/lttngprovider/eventproviderhelpers.cpp, so it would use the same delete [] operator as the ResizeBuffer uses. The generated eventprovdotnetruntime.cpp would then call the DeleteBuffer instead of calling delete [] operator.
Right now, one of the functions in the generated code looks like this:

extern "C" ULONG  FireEtXplatGCHeapStats(
    const unsigned __int64 GenerationSize0,
    const unsigned __int64 TotalPromotedSize0,
    const unsigned __int64 GenerationSize1,
    const unsigned __int64 TotalPromotedSize1,
    const unsigned __int64 GenerationSize2,
    const unsigned __int64 TotalPromotedSize2,
    const unsigned __int64 GenerationSize3,
    const unsigned __int64 TotalPromotedSize3,
    const unsigned __int64 FinalizationPromotedSize,
    const unsigned __int64 FinalizationPromotedCount,
    const unsigned int PinnedObjectCount,
    const unsigned int SinkBlockCount,
    const unsigned int GCHandleCount)
{
    if (!EventXplatEnabledGCHeapStats())
        return ERROR_SUCCESS;

    char stackBuffer[92];
    char *buffer = stackBuffer;
    size_t offset = 0;
    size_t size = 92;
    bool fixedBuffer = true;

    bool success = true;
    success &= WriteToBuffer(GenerationSize0, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(TotalPromotedSize0, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(GenerationSize1, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(TotalPromotedSize1, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(GenerationSize2, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(TotalPromotedSize2, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(GenerationSize3, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(TotalPromotedSize3, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(FinalizationPromotedSize, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(FinalizationPromotedCount, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(PinnedObjectCount, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(SinkBlockCount, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(GCHandleCount, buffer, offset, size, fixedBuffer);

    if (!success)
    {
        if (!fixedBuffer)
            delete[] buffer;
        return ERROR_WRITE_FAULT;
    }

    do_tracepoint(DotNETRuntime, GCHeapStats, offset, buffer);

    if (!fixedBuffer)
        delete[] buffer;

    return ERROR_SUCCESS;
}

So instead of

    if (!fixedBuffer)
        delete[] buffer;

it would use

    if (!fixedBuffer)
        DeleteBuffer(buffer);

@am11
Copy link
Member Author

am11 commented Aug 7, 2021

Something like this: am11@f07b947? we get an extra frame for DeleteBuffer, but it still ends up calling ClrFree with corerun:

$ DOTNET_EnableEventLog=1 lldb /runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun /tmp/helloworld/bin/release/net6.0/helloworld.dll 
(lldb) target create "/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun"
Current executable set to '/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun' (x86_64).
(lldb) settings set -- target.run-args  "/tmp/helloworld/bin/release/net6.0/helloworld.dll"
(lldb) r
Process 630 launched: '/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun' (x86_64)

Assert failure(PID 630 [0x00000276], Thread: 630 [0x0276]): !"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\n" "Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it."
    File: /runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp Line: 296
    Image: /runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun

Process 630 stopped
* thread #1, name = 'corerun', stop reason = signal SIGTRAP
    frame #0: 0x00007ffff6b52ac9 libcoreclr.so`DBG_DebugBreak at debugbreak.S:9
   6   	
   7   	LEAF_ENTRY DBG_DebugBreak, _TEXT
   8   	        int3
-> 9   	        ret
   10  	LEAF_END_MARKED DBG_DebugBreak, _TEXT
   11  	
(lldb) bt all
* thread #1, name = 'corerun', stop reason = signal SIGTRAP
  * frame #0: 0x00007ffff6b52ac9 libcoreclr.so`DBG_DebugBreak at debugbreak.S:9
    frame #1: 0x00007ffff6aa493b libcoreclr.so`::DebugBreak() at debug.cpp:406
    frame #2: 0x00007ffff6814cd8 libcoreclr.so`::DbgAssertDialog(szFile="/runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp", iLine=296, szExpr="!\"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\\n\" \"Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it.\"") at debug.cpp:697
    frame #3: 0x00007ffff67c7872 libcoreclr.so`ClrFree(p=0x0000555555628930) at clrhost_nodependencies.cpp:295
    frame #4: 0x00007ffff67c7625 libcoreclr.so`operator delete[](p=0x0000555555628940) at clrhost_nodependencies.cpp:410
    frame #5: 0x00007ffff6bddc67 libcoreclr.so`::DeleteBuffer(buffer=0x0000555555628940) at eventproviderhelpers.cpp:18
    frame #6: 0x00007ffff6bab6b8 libcoreclr.so`::FireEtXplatRuntimeInformationStart(ClrInstanceID=0, Sku=2, BclMajorVersion=6, BclMinorVersion=0, BclBuildNumber=0, BclQfeNumber=0, VMMajorVersion=42, VMMinorVersion=42, VMBuildNumber=42, VMQfeNumber=42424, StartupFlags=8388611, StartupMode='\0', CommandLine=u"", ComObjectGuid=0x00007ffff5faf640, RuntimeDllPath=u"/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so") at eventprovdotnetruntime.cpp:4905
    frame #7: 0x00007ffff6664e65 libcoreclr.so`FireEtwRuntimeInformationStart(ClrInstanceID=0, Sku=2, BclMajorVersion=6, BclMinorVersion=0, BclBuildNumber=0, BclQfeNumber=0, VMMajorVersion=42, VMMinorVersion=42, VMBuildNumber=42, VMQfeNumber=42424, StartupFlags=8388611, StartupMode='\0', CommandLine=u"", ComObjectGuid=0x00007ffff5faf640, RuntimeDllPath=u"/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so", ActivityId=0x0000000000000000, RelatedActivityId=0x0000000000000000) at clretwallmain.h:2811
    frame #8: 0x00007ffff66515be libcoreclr.so`ETW::InfoLog::RuntimeInformation(type=0) at eventtrace.cpp:5111
    frame #9: 0x00007ffff6a7418f libcoreclr.so`EEStartupHelper() at ceemain.cpp:767
    frame #10: 0x00007ffff6a75bf4 libcoreclr.so`EEStartup(this=0x00007fffffffda18, p=0x0000000000000000)::$_1::operator()(void*) const at ceemain.cpp:1153
    frame #11: 0x00007ffff6a72fcb libcoreclr.so`EEStartup() at ceemain.cpp:1155
    frame #12: 0x00007ffff6a72dd2 libcoreclr.so`EnsureEEStarted() at ceemain.cpp:321
    frame #13: 0x00007ffff621fc1d libcoreclr.so`CorHost2::Start(this=0x0000555555600e10) at corhost.cpp:101
    frame #14: 0x00007ffff6159006 libcoreclr.so`::coreclr_initialize(exePath="/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun", appDomainFriendlyName="corerun", propertyCount=4, propertyKeys=0x000055555558a070, propertyValues=0x000055555558a0f0, hostHandle=0x0000555555566068, domainId=0x0000555555566070) at unixinterface.cpp:251
    frame #15: 0x000055555555f127 corerun`run(config=0x00007fffffffe3f0) at corerun.cpp:352
    frame #16: 0x000055555555c616 corerun`main(argc=2, argv=0x00007fffffffe568) at corerun.cpp:554
    frame #17: 0x00007ffff6ca1bf7 libc.so.6`__libc_start_main(main=(corerun`main at corerun.cpp:546), argc=2, argv=0x00007fffffffe568, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe558) at libc-start.c:310
    frame #18: 0x000055555555b02a corerun`_start + 42
  thread #2, name = 'corerun-ust'
    frame #0: 0x00007ffff79c2657 libpthread.so.0`__libc_recvmsg + 71
    frame #1: 0x00007ffff598bc23 liblttng-ust.so.0`ustcomm_recv_unix_sock + 179
    frame #2: 0x00007ffff5991241 liblttng-ust.so.0`___lldb_unnamed_symbol62$$liblttng-ust.so.0 + 2241
    frame #3: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #4: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #3, name = 'corerun-ust'
    frame #0: 0x00007ffff6d9b639 libc.so.6`syscall at syscall.S:38
    frame #1: 0x00007ffff5990bd3 liblttng-ust.so.0`___lldb_unnamed_symbol62$$liblttng-ust.so.0 + 595
    frame #2: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #3: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #4, name = 'corerun'
    frame #0: 0x00007ffff6d94cb9 libc.so.6`__GI___poll(fds=0x00007fffef374a20, nfds=1, timeout=-1) at poll.c:29
    frame #1: 0x00007ffff6b309c9 libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(this=0x00005555555ff630, iTimeout=-1, pRecvBuf="", iBytes=1) at synchmanager.cpp:2233
    frame #2: 0x00007ffff6b2fb88 libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadCmdFromProcessPipe(this=0x00005555555ff630, iPollTimeout=-1, pswcWorkerCmd=0x00007fffef374de4, pshridMarshaledData=0x00007fffef374dd8, pdwData=0x00007fffef374dd4) at synchmanager.cpp:2011
    frame #3: 0x00007ffff6b2e736 libcoreclr.so`CorUnix::CPalSynchronizationManager::WorkerThread(pArg=0x00005555555ff630) at synchmanager.cpp:1714
    frame #4: 0x00007ffff6b4d885 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x0000555555600350) at thread.cpp:1862
    frame #5: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #6: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #5, name = 'corerun'
    frame #0: 0x00007ffff6d94cb9 libc.so.6`__GI___poll(fds=0x00007fff70008ee0, nfds=1, timeout=-1) at poll.c:29
    frame #1: 0x00007ffff68a4053 libcoreclr.so`ipc_poll_fds(fds=0x00007fff70008ee0, nfds=1, timeout=4294967295) at ds-ipc-pal-socket.c:485
    frame #2: 0x00007ffff68a3e44 libcoreclr.so`ds_ipc_poll(poll_handles_data=0x00007fffeeb73b60, poll_handles_data_len=1, timeout_ms=4294967295, callback=(libcoreclr.so`server_warning_callback(char const*, unsigned int) at ds-server.c:110))(char const*, unsigned int)) at ds-ipc-pal-socket.c:1082
    frame #3: 0x00007ffff6a51ceb libcoreclr.so`ds_ipc_stream_factory_get_next_available_stream(callback=(libcoreclr.so`server_warning_callback(char const*, unsigned int) at ds-server.c:110))(char const*, unsigned int)) at ds-ipc.c:395
    frame #4: 0x00007ffff6a4f117 libcoreclr.so`server_thread(data=0x0000000000000000) at ds-server.c:127
    frame #5: 0x00007ffff6b4d885 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x000055555561e340) at thread.cpp:1862
    frame #6: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #7: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95

@josalem
Copy link
Contributor

josalem commented Aug 11, 2021

It looks like you changed the geneventpipe.py script as well. I think the geneventpipe.py script fixed this same issue by redefining ResizeBuffer on top of the PAL:

bool ResizeBuffer(BYTE *&buffer, size_t& size, size_t currLen, size_t newSize, bool &fixedBuffer)
{
newSize = (size_t)(newSize * 1.5);
_ASSERTE(newSize > size); // check for overflow
if (newSize < 32)
newSize = 32;
BYTE *newBuffer = new (nothrow) BYTE[newSize];
if (newBuffer == NULL)
return false;
memcpy(newBuffer, buffer, currLen);
if (!fixedBuffer)
delete[] buffer;
buffer = newBuffer;
size = newSize;
fixedBuffer = false;
return true;
}

Does this still happen if you forward declare DeleteBuffer in genlttngprovider.py like ResizeBuffer is?

@am11
Copy link
Member Author

am11 commented Aug 12, 2021

CMIIW, but from the backtrace, it is hitting our corerun-specific placement delete[] operator:

frame #4: 0x00007ffff67c1625 libcoreclr.so`operator delete[](p=0x0000555555628e60) at clrhost_nodependencies.cpp:410

which we don't want, right? It is hitting with main branch, with DeleteBuffer in pal.h as well as forward declaration of DeleteBuffer approach too (without declaring DeleBuffer in pal.h).

We can try defining the ResizeBuffer in genlttngprovider.py like it is done in geneventpipe.py without DeleteBuffer. I'll try that next.

@am11
Copy link
Member Author

am11 commented Aug 12, 2021

We can try defining the ResizeBuffer in genlttngprovider.py like it is done in geneventpipe.py without DeleteBuffer. I'll try that next.

That didn't work either because geneventpipe.py and genlttngprovider.py seem to have different ways of using PAL. Former seem to generate source which builds as a runtime component (using typedefs from PAL BYTE* and friends). Latter is producing source which is building as part of PAL, where it is allowed to mix PAL and STL headers and they conflict in mysterious ways. I also tried commenting out placement operator in clrhost_nodependencies, which gave other runtime errors (free(): invalid size). I am not sure if I have fully understood the dynamics involved in this case.

if you aren't planning to include that change in your PR, I'll go ahead and make it.

Didn't mean to block; I gave it a fair try, but couldn't make it work. So please go ahead. 😅

@josalem
Copy link
Contributor

josalem commented Aug 12, 2021

Didn't mean to block; I gave it a fair try

Thanks for giving it a try! You did all the things I would have done as well. Seems there may be some extra complexity here. Since this is only happening in debug, I'll try to get it fixed in 6, but this is definitely a candidate for being bumped to 7.

@janvorli does @am11's experience trigger any top-of-mind thoughts for what might be going on?

@josalem josalem modified the milestones: 6.0.0, 7.0.0 Aug 12, 2021
@josalem
Copy link
Contributor

josalem commented Aug 12, 2021

pre-emptively moving this to 7. I'll hopefully be able to get this, but the debug only nature of it lowers the priority.

@am11
Copy link
Member Author

am11 commented Aug 23, 2021

related, just noticed that compiling runtime with gcc-11 gives a lots of these errors on current main branch:

  [ 25%] Building CXX object pal/src/CMakeFiles/coreclrpal.dir/misc/sysinfo.cpp.o
  [ 25%] Building CXX object pal/src/CMakeFiles/coreclrpal.dir/misc/time.cpp.o
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp: In function 'ULONG EventPipeWriteEventGCPerHeapHistory(LPCGUID, LPCGUID)':
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:492:22: error: 'void operator delete [](void*)' called on unallocated object 'stackBuffer' [-Werror=free-nonheap-object]
    492 |             delete[] buffer;
        |                      ^~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:480:10: note: declared here
    480 |     BYTE stackBuffer[32];
        |          ^~~~~~~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:499:18: error: 'void operator delete [](void*)' called on unallocated object 'stackBuffer' [-Werror=free-nonheap-object]
    499 |         delete[] buffer;
        |                  ^~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:480:10: note: declared here
    480 |     BYTE stackBuffer[32];
        |          ^~~~~~~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp: In function 'ULONG EventPipeWriteEventEEStartupStart(LPCGUID, LPCGUID)':
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1753:22: error: 'void operator delete [](void*)' called on unallocated object 'stackBuffer' [-Werror=free-nonheap-object]
   1753 |             delete[] buffer;
        |                      ^~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1741:10: note: declared here
   1741 |     BYTE stackBuffer[32];
        |          ^~~~~~~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1760:18: error: 'void operator delete [](void*)' called on unallocated object 'stackBuffer' [-Werror=free-nonheap-object]
   1760 |         delete[] buffer;
        |                  ^~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1741:10: note: declared here
   1741 |     BYTE stackBuffer[32];
        |          ^~~~~~~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp: In function 'ULONG EventPipeWriteEventEEStartupEnd(LPCGUID, LPCGUID)':
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1826:22: error: 'void operator delete [](void*)' called on unallocated object 'stackBuffer' [-Werror=free-nonheap-object]
   1826 |             delete[] buffer;
        |                      ^~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1814:10: note: declared here
   1814 |     BYTE stackBuffer[32];
        |          ^~~~~~~~~~~

gcc 10 and 9 or clang don't have this analyzer. @janvorli, this seems like a good error to keep un-suppressed.


and a slightly related warning (from same family of new warnings I think):

  [ 28%] Building CXX object vm/wks/CMakeFiles/cee_wks_core.dir/__/codeversion.cpp.o
  [ 28%] Building CXX object vm/wks/CMakeFiles/cee_wks_core.dir/__/contractimpl.cpp.o
  /runtime/src/coreclr/vm/ceeload.cpp: In member function 'void Module::InitializeDynamicILCrst()':
  /runtime/src/coreclr/vm/ceeload.cpp:1662:16: error: 'void operator delete(void*)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete]
   1662 |         delete pCrst;
        |                ^~~~~
  /runtime/src/coreclr/vm/ceeload.cpp:1658:97: note: returned from 'static void* Crst::operator new(size_t)'
   1658 |     Crst * pCrst = new Crst(CrstDynamicIL, CrstFlags(CRST_UNSAFE_ANYMODE | CRST_DEBUGGER_THREAD));
        |                                                                                                 ^
  /runtime/src/coreclr/vm/ceeload.cpp:1658:97: error: 'void operator delete(void*)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete]
  /runtime/src/coreclr/vm/ceeload.cpp:1658:97: note: returned from 'static void* Crst::operator new(size_t)'
  [ 28%] Building CXX object vm/wks/CMakeFiles/cee_wks_core.dir/__/corhost.cpp.o

@am11
Copy link
Member Author

am11 commented Sep 2, 2021

related gcc 11 warnings have been disabled:

add_compile_options($<$<COMPILE_LANGUAGE:CXX>:-Wno-mismatched-new-delete>)
add_compile_options($<$<COMPILE_LANGUAGE:CXX>:-Wno-free-nonheap-object>)
add_compile_options($<$<COMPILE_LANGUAGE:CXX>:-Wno-placement-new>)

@AaronRobinsonMSFT
Copy link
Member

AaronRobinsonMSFT commented Sep 3, 2021

@am11 I believe this is the same or similar to #57577.

@am11
Copy link
Member Author

am11 commented Sep 3, 2021

@AaronRobinsonMSFT, yup, these seem to be similar reports of incorrect pairing of {de}allocators in different places.
The good news is that the three gcc 11 warnings are highlighting all places under coreclr on Linux, so we don't have to wait for the next mismatch to be identified by someone. I think if we un-suppress these three warnings and fix all the reported instances, it will fix both issues and make allocations usages robust.

@am11
Copy link
Member Author

am11 commented Sep 3, 2021

@josalem, while going through these compiler warnings, I noticed that in dotnetruntimeprivate.cpp we have generated functions like:

extern "C" ULONG EventPipeWriteEventExecExeEnd(
    LPCGUID ActivityId,
    LPCGUID RelatedActivityId)
{
    if (!EventPipeEventEnabledExecExeEnd())
        return ERROR_SUCCESS;

    BYTE stackBuffer[32];
    BYTE *buffer = stackBuffer;
    size_t offset = 0;
    size_t size = 32;
    bool fixedBuffer = true;
    bool success = true;



    if (!success)
    {
        if (!fixedBuffer)
            delete[] buffer;
        return ERROR_WRITE_FAULT;
    }

    EventPipeAdapter::WriteEvent(EventPipeEventExecExeEnd, (BYTE *)buffer, (unsigned int)offset, ActivityId, RelatedActivityId);

    if (!fixedBuffer)
        delete[] buffer;

    return ERROR_SUCCESS;
}

In this case if (!success) and if (!fixedBuffer) are dead blocks and we seem to have a memory leak as we never delete the buffer (because there was no call to WriteToBuffer.
I am not sure what is the bug, either:

  • we need to fix the condition in python so WriteToBuffer is emitted
  • this is genuinely a dead function, so we can skip generating the whole function.
  • something else?

@josalem
Copy link
Contributor

josalem commented Sep 3, 2021

I think I'm missing what is leaking. The buffer is stack allocated here, no? I would really hope the compiler would remove the dead code blocks and reduce this to just the call to WriteEvent.

You're right though, those appear to be dead code.

@am11
Copy link
Member Author

am11 commented Sep 4, 2021

I think I'm missing what is leaking.

Ah right, then why delete[] buffer at all? it's UB..

hope the compiler would remove the dead code blocks and reduce this to just the call to WriteEvent.

Yes compiler has enough mileage that it will produce fairly highly optimized code, but during the compilation different level of analysis are warning about various issues especially in exported methods, symbols of which we explicitly want to keep (tautological mistakes, dead code, deletion of unallocated buffer (which in grand scheme of things, tautologically never true but still incorrect in its own scope)) etc. is something we should fix. We have just suppressed all such warnings from all supported compilers, rather than fixing the source generator genEventPipe.py and friends, which is likely how the real bugs slipped away.

@josalem
Copy link
Contributor

josalem commented Sep 7, 2021

100% agree. Based on the structure of the generated code, those if statements should be something we know the answer to at generation time.

Ah right, then why delete[] buffer at all? it's UB..

Yeah... I saw that when I was reading the snippet. Luckily, it shouldn't hit the delete[] due to the pre-baked if statements, but still, it's troubling.

I think this is all pointing to the need to go through the generation scripts and update them.

@janvorli
Copy link
Member

I have looked into fixing the delete[] / new discrepancy. I believe the problem is caused by the fact that the new (std::nothrow) is an operator that we never define ourselves, thus the code always calls the one from the standard libraries. However, we do define the global delete[] operator in the VM. And since VM is linked statically with PAL, the linker links the code with our global delete[].
The fix is to use malloc / free instead. That doesn't have the problem I've mentioned. In fact, I've just remembered that I had to remove couple of usages of delete / new from PAL due to this kind of a problem in the past.

@janvorli
Copy link
Member

See the PR where I've removed usages of new / delete here: dotnet/coreclr#1822

@janvorli
Copy link
Member

I would still add the DeleteBuffer function though. I have also noticed that for some reason, the genEventPipe.py uses BYTE* and the other char* for the buffer. It seems it would make sense to unify them. And then there will be no need for the extra ResizeBuffer implementation in the genEventPipe.py generated code. It could use the same one from the eventproviderhelpers.cpp.

@josalem
Copy link
Contributor

josalem commented Sep 14, 2021

new (std::nothrow) is an operator that we never define ourselves

Is this a conscious decision or has it just never been needed before?

When I search for it in the runtime, I see lots of uses of new (nothrow). They seem to mostly be in the DAC and GC, but there are a few in VM as well.

@janvorli
Copy link
Member

That is not std::nothrow, but rather our own "nothrow" constant defined here:

const NoThrow nothrow = { 0 };

and declared here:
extern const NoThrow nothrow;

As such, it is not available in the PAL.

@davmason
Copy link
Contributor

@am11 @janvorli I am going through bugs, I don't see that a fix (at least not the one discussed here) was made, but when I attempt to recreate I do not see the same same error. I am setting DOTNET_EnableEventLog=1 and running a random C# app with a debug corerun.

Is there something I'm missing?

@am11
Copy link
Member Author

am11 commented Jul 14, 2022

@davmason, thanks for testing it, great news! I forgot about this issue. :)

On linux-x64 machine, I cannot reproduce it with latest main branch. I quickly tested it with 5bef921 (from top post; the state of main, when I originally found it) and reproduced on the same machine.

It is very possible that it has been fixed during the SDL compliance work by @AaronRobinsonMSFT and others (#66154).

Closing.

@am11 am11 closed this as completed Jul 14, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants