DeepTauId throws (in event with zero taus?) #42444

VinInn · 2023-08-02T07:07:48Z

The original issue #40437 has taken a different path so I reopen here just for this.

Reminder of relevant post:
#40437 (comment)
#40437 (comment)
#28358 (closed. WHY?)

a log file
https://cms-unified.web.cern.ch/cms-unified/joblogs/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693/8001/DataProcessing/020da37f-6871-4688-a86e-2b7e8a6bc683-26-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

here the "shooting gun": trying to reproduce the event happen to have ZERO taus
#40437 (comment)

cmsbuild · 2023-08-02T07:08:04Z

A new Issue was created by @VinInn Vincenzo Innocente.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

VinInn · 2023-08-02T07:31:05Z

I still do not understand how a memory corruption can modify the values in an otherwise empty vector (pointer, size, capacity, all 0) w/o causing a segfault before reaching the throw.
I'n my opinion there is a legit vector there with "something" in it that reasonably fake a tau.

VinInn · 2023-08-02T10:22:18Z

the tau collection is a view: no way to survive even the first deference

slava77 · 2023-08-02T12:23:01Z

it sounds also there is a reproducibility problem.
The PR tests cover the single thread cases somewhat (even though not on the right SSE-only hardware)

Could it be multithreaded reproducibility issue?
Perhaps just running reco comparisons on MT job outputs is enough to get an idea.

makortel · 2023-08-02T12:51:24Z

assign reconstruction

cmsbuild · 2023-08-02T12:51:41Z

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2023-11-16T20:32:18Z

Let me tag @cms-sw/tau-pog-l2 also in this issue

makortel · 2023-11-16T20:33:28Z

Just for the record, I'm hitting into this issue while trying to run an HLT job through VTune. Without VTune the configuration works, but within VTune the job crashes in a few different ways, this exception being one of them.

makortel · 2023-11-17T01:16:38Z

Running VTune on a single-thread job in ASAN build resulted in

AddressSanitizer: CHECK failed: asan_allocator.cpp:188 "((old)) == ((kAllocBegMagic))" (0x302030303a303020, 0xcc6e96b9cc6e96b9) (tid=2772)
    #0 0x7fddac36d4ba in CheckUnwind ../../../../libsanitizer/asan/asan_rtl.cpp:67
    #1 0x7fddac38de65 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../libsanitizer/sanitizer_common/sanitizer_termination.cpp:86
    #2 0x7fddac2d845b in __asan::LargeChunkHeader::Set(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:188
    #3 0x7fddac2d845b in __asan::LargeChunkHeader::Set(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:178
    #4 0x7fddac2d845b in __asan::QuarantineCallback::Recycle(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:204
    #5 0x7fddac2d8655 in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::DoRecycle(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:193
    #6 0x7fddac2d8b3d in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Recycle(unsigned long, __asan::QuarantineCallback) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:181
    #7 0x7fddac2d411c in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Put(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback, __asan::AsanChunk*, unsigned long) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:112
    #8 0x7fddac2d411c in __asan::Allocator::QuarantineChunk(__asan::AsanChunk*, void*, __sanitizer::BufferedStackTrace*) ../../../../libsanitizer/asan/asan_allocator.cpp:665
    #9 0x7fddac3655ad in operator delete(void*, unsigned long) ../../../../libsanitizer/asan/asan_new_delete.cpp:164
    #10 0x7fdd6f44320f in std::_Sp_counted_ptr_inplace<BasicSingleTrajectoryState, churn_allocator<BasicSingleTrajectoryState>, (__gnu_cxx::_Lock_policy)2>::_M_destroy() (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libTrackingToolsTrajectoryState.so+0x5e20f)
    #11 0x7fdc6843beed in CkfTrajectoryBuilder::limitedCandidates(std::shared_ptr<TrajectorySeed const> const&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<Trajectory, std::allocator<Trajectory> >&) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf4eed)
    #12 0x7fdc6843fbb6 in CkfTrajectoryBuilder::limitedCandidates(TrajectorySeed const&, TempTrajectory&, std::vector<Trajectory, std::allocator<Trajectory> >&) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf8bb6)
    #13 0x7fdc6844079c in CkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf979c)
    #14 0x7fdc683c9306 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0x82306)
    #15 0x7fdc683d04fe in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0x894fe)
    #16 0x7fddac0f6dd0 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so+0x986dd0)

that hints towards ASAN own data structures being overwritten

makortel · 2023-11-17T20:26:57Z

Running VTune on a single-thread job of cmsRunGlibC resulted in

*** Error in `cmsRunGlibC': corrupted size vs. prev_size: 0x0000000047f1cc00 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f474)[0x7f8d1b99f474]
/lib64/libc.so.6(+0x816a4)[0x7f8d1b9a16a4]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9eadc44)[0x7f8c7f1a0c44]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9eae8bb)[0x7f8c7f1a18bb]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9fd2efd)[0x7f8c7f2c5efd]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9af7e27)[0x7f8c7edeae27]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x97995d3)[0x7f8c7ea8c5d3]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x71123b7)[0x7f8c7c4053b7]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZNK5Eigen30TensorContractionEvaluatorBaseINS_15TensorEvaluatorIKNS_19TensorContractionOpIKNS_5arrayINS_9IndexPairIlEELm1EEEKNS_17TensorReshapingOpIKNS_6DSizesIlLi2EEEKNS_18TensorImagePatchOpILln1ELln1EKNS_9TensorMapINS_6TensorIKfLi4ELi1ElEELi16ENS_11MakePointerEEEEEEEKNS8_ISB_SJ_EEKN10tensorflow33LaunchFusedConv2DWithOutputKernelIfE19OutputKernelWrapperEEENS_16ThreadPoolDeviceEEEE15evalGemmPartialILb1ELb1ELb0ELi0ELb1EEEvPflli+0x7ad)[0x7f8c7c455c6d]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN5Eigen8internal14TensorExecutorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIfLi4ELi1ElEELi16ENS_11MakePointerEEEKNS_17TensorReshapingOpIKNS_6DSizesIlLi4EEEKNS_19TensorContractionOpIKNS_5arrayINS_9IndexPairIlEELm1EEEKNS8_IKNS9_IlLi2EEEKNS_18TensorImagePatchOpILln1ELln1EKNS3_INS4_IKfLi4ELi1ElEELi16ES6_EEEEEEKNS8_ISJ_SO_EEKN10tensorflow33LaunchFusedConv2DWithOutputKernelIfE19OutputKernelWrapperEEEEEEENS_16ThreadPoolDeviceELb1ELNS0_15TiledEvaluationE0EE3runERS15_RKS16_+0x95)[0x7f8c7c487365]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x7195164)[0x7f8c7c488164]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow33LaunchFusedConv2DWithOutputKernelIfEclINS_19BiasAddOutputKernelIfNS_8IdentityEEEEEvRKT_PNS_15OpKernelContextERKNS_6TensorESD_PSB_+0x1ed)[0x7f8c7c4896ad]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13FusedConv2DOpIN5Eigen16ThreadPoolDeviceEfE7ComputeEPNS_15OpKernelContext[39/1803]0x7f8c7c48e1e5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(_ZN10tensorflow16ThreadPoolDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x4b)[0x7f8c74ae2f6b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1258ef1)[0x7f8c74bd0ef1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1259fd6)[0x7f8c74bd1fd6]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN3tsl6thread10ThreadPool8ScheduleESt8functionIFvvEE+0xfc)[0x7f8c81c8c99c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0xcad7d78)[0x7f8c81dcad78]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x123e924)[0x7f8c74bb6924]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x124436e)[0x7f8c74bbc36e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x124dfa5)[0x7f8c74bc5fa5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x125820c)[0x7f8c74bd020c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1259fd6)[0x7f8c74bd1fd6]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN3tsl6thread10ThreadPool8ScheduleESt8functionIFvvEE+0xfc)[0x7f8c81c8c99c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0xcad7d78)[0x7f8c81dcad78]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1244271)[0x7f8c74bbc271]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1248f08)[0x7f8c74bc0f08]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession11RunInternalElRKNS_10RunOptionsEPNS_18CallFrameInterfaceEPNS0_16ExecutorsAndKeysEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x7fa)[0x7f8c81ddd61a]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession3RunERKNS_10RunOptionsERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISD_EERKS4_ISB_SaISB_EESL_PS4_ISC_SaISC_EEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x96e)[0x7f8c81de034e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EERKN3tsl6thread17ThreadPoolOptionsE+0xa5)[0x7f8c876b54d5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EEPN3tsl6thread19ThreadPoolInterfaceE+0x19)[0x7f8c876b5589]
/build/mkortela/debug/hlt/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so(+0xa729a)[0x7f8bf985f29a]
/build/mkortela/debug/hlt/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so(+0x9bb74)[0x7f8bf9853b74]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm6stream21EDProducerAdaptorBase7doEventERKNS_19EventTransitionInfoEPNS_16ActivityRegistryEPKNS_20ModuleCallingContextE+0x141)[0x7f8d1e8f54b1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8a19)[0x7f8d1e865a19]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8fb4)[0x7f8d1e865fb4]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreConcurrency.so(+0x5f28)[0x7f8d253faf28]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x28083)[0x7f8d1cb4c083]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x13d21b)[0x7f8d1e7ea21b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor11processRunsEv+0x205)[0x7f8d1e7f3bc5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor15runToCompletionEv+0x23f)[0x7f8d1e7f418f]
cmsRunGlibC[0x4074ef]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x149bd)[0x7f8d1cb389bd]
cmsRunGlibC[0x408ed2]
cmsRunGlibC(main+0x14c)[0x40518c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f8d1b942555]
cmsRunGlibC[0x405451]
======= Memory map: ========
00400000-00404000 r--p 00000000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00404000-00405000 r-xp 00004000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00405000-0040a000 r-xp 00005000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040a000-0040d000 r--p 0000a000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040d000-0040e000 r--p 0000c000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040e000-0040f000 rw-p 0000d000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040f000-00410000 rwxp 00000000 00:00 0
00419000-4d534000 rw-p 00000000 00:00 0                                  [heap]
7f8bd8000000-7f8bd8021000 rw-p 00000000 00:00 0
7f8bd8021000-7f8bdc000000 ---p 00000000 00:00 0
7f8bdee6c000-7f8be100b000 rw-p 00000000 00:00 0
7f8be305c000-7f8be314c000 rwxp 00000000 00:00 0
7f8be314c000-7f8be8000000 rw-p 00000000 00:00 0
7f8be8000000-7f8be8021000 rw-p 00000000 00:00 0
7f8be8021000-7f8bec000000 ---p 00000000 00:00 0
7f8bec03d000-7f8bec35d000 rwxp 00000000 00:00 0
7f8bec35d000-7f8bec379000 rw-p 00000000 00:00 0
7f8bec379000-7f8bec3c9000 rwxp 00000000 00:00 0
7f8bec3e5000-7f8bec435000 rwxp 00000000 00:00 0
7f8bec444000-7f8bec44a000 r--p 00000000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec450000-7f8bec4a0000 rwxp 00000000 00:00 0
7f8bec4a0000-7f8bec4a2000 r--p 00000000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a2000-7f8bec4a3000 r-xp 00002000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a3000-7f8bec4a4000 r--p 00003000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a4000-7f8bec4a5000 r--p 00003000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a5000-7f8bec4a6000 rw-p 00004000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a6000-7f8bec507000 rwxp 00000000 00:00 0

makortel · 2023-11-17T21:25:30Z

Looks like memory corruption inside Tensorflow is not unheard of

(no idea if any of these are in any way related)

makortel · 2023-11-17T21:29:16Z

We updated tensorflow to 2.12 in 13_3_0_pre1 (built on Aug 3). I wonder if there would be any temporal correlation with the RECO profiling crashes?

makortel · 2023-11-17T21:36:26Z

At least in CMSSW_13_2_X_2023-11-17-1100 the IgProf job on step3 of workflow 136.889 succeeds, whereas in CMSSW_13_3_X_2023-11-17-1100 the job crashes (but of course there are many other changes in between)

makortel · 2023-11-17T21:38:46Z

(maybe I'm going in a wrong direction, given that the DeepTauId throwing an exception on nan is known for several years #28358)

makortel · 2023-11-22T16:53:58Z

In my HLT menu test case with VTune, if I "disable" the two DeepTauId modules by adding an always-rejecting EDFilter before them, the job succeeds. Enabling either one, or both, of the DeepTauId modules makes the job crash most of the time.

kandrosov · 2023-11-23T08:29:28Z

Hi @makortel. By "succeed", do you mean just to finish without crashing, or the outputs are always identical? Especially for other taggers that use TF. In the past, we couldn't conclude with 100% certainty whether memory corruption happens in the DeepTau module or it occures elsewhere and corrupts DeepTau-related memory. Is the machine on which one could reproduce the crash accessible by normal users? If yes, could you please share instructions on how one can reproduce the crash?

makortel · 2023-11-29T19:42:38Z

Hi @kandrosov

By "succeed", do you mean just to finish without crashing, or the outputs are always identical?

I mean "finishes without crashing". I'm not checking the outputs in any way.

I sent you privately the HLT recipe that I used. By adding an event rejecting EDFilter before/after the DeepTauId modules in the menu I see the behavior that

if the EDFilter is right before DeepTauId module, the VTune profiling job (technically) succeeds
if the EDFilter is right after DeepTauId module, the VTune profiling job crashes

I started also look into the step3 of the workflow 136.889, that is used in the IgProf profiling in IBs (which is currently crashing in 13_3_X and 14_0_X). Running the step3 as it is indeed crashes also in VTune, in a way that looks like memory corruption. If I run everything up to and including the DeepTauId modules there (*), the job still crashes. But, contrary to the HLT test case, if I run everything up to the modules DeepTauId consumes (**), but not DeepTauId, the job crashes as well. This behavior would indicate that something else than (or maybe in addition to?) DeepTauId would cause memory corruption when the job is run through VTune.

I'll continue to investigate (albeit slowly).

(*) by adding a snippet

process.deepTauSequence = cms.Sequence(process.deepTau2017v2p1ForMini+process.deepTau2018v2p5ForMini
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(
    process.raw2digi_step,
    process.reconstruction_step,
    process.tauPath
)

to the end of the step3 configuration

(**) by adding a snippet

process.deepTauSequence = cms.Sequence(
    process.offlineSlimmedPrimaryVertices
    + process.packedPFCandidates
    + process.hpsPFTauTransverseImpactParameters
    + process.slimmedTausNoDeepIDs
    + process.slimmedElectrons
    + process.slimmedMuons
)
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(
    process.raw2digi_step,
    process.reconstruction_step,
    process.tauPath
)

to the end of the step3 configuration

makortel · 2023-11-29T23:27:59Z

With the step3 of 136.889 I got to the point that running initialStep and everything it consumes crashes (under VTune), while running any of the modules initialStep consumes individually technically works. The smoking gun here is that initialStep module is of type TrackTfClassifier that uses Tensorflow. The stack trace of the working thread contains a few interesting frames, even if it is mostly corrupted

Thread 1 (Thread 0x7f2b0d5aa740 (LWP 21048) "cmsRun"):
#0  0x00007f2b0f464ddd in poll () from /lib64/libc.so.6
#1  0x00007f2b16f75841 in ?? ()
#2  0x01007f2b0d5a4858 in ?? ()
#3  0x00007ffc13af4400 in ?? ()
#4  0x00007ffc13af4920 in ?? ()
#5  0x00007ffc13af4410 in ?? ()
#6  0x00007ffc13af44a0 in ?? ()
#7  0x00007f2b176b1060 in ?? ()
#8  0x0000002100000001 in ?? ()
#9  0x00007ffc13af4390 in ?? ()
#10 0x00007f2b0d582240 in ?? ()
#11 0x00007f2b13d241c0 in ?? ()
#12 0x000000006567babf in ?? ()
#13 0x0000000000098c43 in ?? ()
#14 0x00007f2a1c780320 in ?? ()
#15 0x00007f2a71cc753b in tensorflow::SimplePropagatorState::SimplePropagatorState(tensorflow::ImmutableExecutorState const&, long, tensorflow::ImmutableExecutorState::FrameInfo const&, bool) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2
#16 0x00007f2ad712db5f in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#17 0x00007f2ad70e650c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#18 0x00007f2ad70e6e70 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#19 0x00007f2b1715b0fe in ?? ()
#20 0x00007ffc13af5730 in ?? ()
#21 0x00007f2b15106698 in ?? ()
#22 0x000000000000000b in ?? ()
#23 0x00007ffc13af5730 in ?? ()
#24 0x00007ffc13af5170 in ?? ()
#25 0x00007f2b15106060 in ?? ()
#26 0x00007ffc13af5270 in ?? ()
#27 0x00007f2b1715c416 in ?? ()
#28 0x000000000000000b in ?? ()
#29 0x00007ffc13af5600 in ?? ()
#30 0x0000000000000000 in ?? ()

Recipe to reproduce (at CERN) is more or less

cmsrel CMSSW_14_0_0_pre0
cd CMSSW_14_0_0_pre0/src
cmsenv
runTheMatrix -l 136.889 --ibeos
cd 136.889_RunMET2018D
cat >>step3_RAW2DIGI_L1Reco_RECO_SKIM_PAT_ALCA_DQM.py <<EOF
def make(name):
    print(f"Only up to {name}")
    process.testSequence = cms.Sequence(getattr(process, name))
make("initialStep") # should crash
# immediate dependencies of initialStep
#make("offlineBeamSpot")
#make("firstStepPrimaryVertices")
#make("initialStepTracks")
process.testPath = cms.Path(process.testSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(process.raw2digi_step, process.reconstruction_step, process.testPath)
EOF

source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
vtune -collect hotspots -- cmsRun step3_RAW2DIGI_L1Reco_RECO_SKIM_PAT_ALCA_DQM.py

makortel · 2023-11-29T23:33:59Z

To be again more specific for DeepTauId, I took workflow 140.201 that runs re-MINI, and was able to reproduce the behavior I saw in the earlier HLT job, i.e. including DeepTauId crashes (under VTune), but running only up to the modules consumed by the DeepTauId modules technically works.

Recipe to reproduce at CERN

cmsrel CMSSW_14_0_0_pre0
cd CMSSW_14_0_0_pre0/src
cmsenv
runTheMatrix -l 140.201 --ibeos
cd 140.201_RunJetMET2022D_reMINI
cat >>step2_PAT.py <<EOF
process.deepTauSequence = cms.Sequence(process.deepTau2017v2p1ForMini+process.deepTau2018v2p5ForMini)
# uncomment for the modules consumed by DeepTauId
#process.deepTauSequence = cms.Sequence(process.offlineSlimmedPrimaryVertices+process.packedPFCandidates+process.slimmedTausNoDeepIDs+process.slimmedElectrons+process.slimmedMuons)
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(process.tauPath)
EOF

source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
export SITECONFIG_PATH=/cvmfs/cms-ib.cern.ch/SITECONF/local
vtune -collect hotspots -- cmsRun step2_PAT.py

makortel · 2023-11-30T15:18:04Z

I repeated my tests in #42444 (comment) and #42444 (comment) with IgProf (*), and see exactly the same behavior. I.e. running up to the modules that initialStep / DeepTauId consume technically works, but running initialStep / DeepTauId results in crash.

(*) replace

source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
vtune -collect hotspots -- <command>

with

igprof -d -t cmsRun -pp -z -o igprof.perf.gz <command>

FYI @gartung . I think we should add crash detection to the IB profiling jobs. Something along after detecting a crash in some step of the workflow, not running the subsequent steps, and communicating the failure to the IB dashboard (e.g. similarly to the HLT validation tests that show a red backround when there are failures).

makortel · 2023-11-30T20:17:18Z

With IgProf I managed to run it together with Valgrind, and got (on 136.889 step 3 customized to run initialStep and its dependencies)

valgrind: m_mallocfree.c:278 (mk_plain_bszB): Assertion 'bszB != 0' failed.
valgrind: This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.


host stacktrace:
==11218==    at 0x5803F638: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x5803F757: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x5803F8DE: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58046B96: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58047679: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x5800485C: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58004CE7: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58004EB9: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x580933DB: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x580DC945: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 11218)
==11218==    at 0x402EF11: operator new(unsigned long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==11218==    by 0x55DEC210: tensorflow::OpKernelContext::allocate_output(int, tensorflow::TensorShape const&, tensorflow::Tensor**, tsl::AllocatorAttributes) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x55DEC6D0: tensorflow::OpKernelContext::allocate_output(int, tensorflow::TensorShape const&, tensorflow::Tensor**) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4A65B582: tensorflow::FusedMatMulOp<Eigen::ThreadPoolDevice, float>::Compute(tensorflow::OpKernelContext*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x5606CF6A: tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615AEF0: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615BFD5: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4F8BC99B: tsl::thread::ThreadPool::Schedule(std::function<void ()>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x4F9FAD77: std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&)::{lambda(std::function<void ()>)#6}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x56140923: void tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::RunTask<std::_Bind<void (tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::*(tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>*, tensorflow::SimplePropagatorState::TaggedNode, long))(tensorflow::SimplePropagatorState::TaggedNode, long)> >(std::_Bind<void (tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::*(tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>*, tensorflow::SimplePropagatorState::TaggedNode, long))(tensorflow::SimplePropagatorState::TaggedNode, long)>&&, int) [clone .constprop.0] (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5614636D: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ScheduleReady(absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5614FFA4: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::NodeDone(tsl::Status const&, absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::NodeExecStatsInterface*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615A20B: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615BFD5: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4F8BC99B: tsl::thread::ThreadPool::Schedule(std::function<void ()>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x4F9FAD77: std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&)::{lambda(std::function<void ()>)#6}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x56146270: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ScheduleReady(absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5614AF07: tensorflow::(anonymous namespace)::ExecutorImpl::RunAsync(tensorflow::Executor::Args const&, std::function<void (tsl::Status const&)>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4FA0D619: tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x4FA1034D: tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x42F1D4D4: tensorflow::run(tensorflow::Session*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so)
==11218==    by 0x42F1D588: tensorflow::run(tensorflow::Session*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tsl::thread::ThreadPoolInterface*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so)
==11218==    by 0x5F36B736: (anonymous namespace)::TfDnn::operator()(std::vector<reco::Track, std::allocator<reco::Track> > const&, reco::BeamSpot const&, std::vector<reco::Vertex, std::allocator<reco::Vertex> > const&) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so)
==11218==    by 0x5F36C085: TrackMVAClassifier<(anonymous namespace)::TfDnn, void>::computeMVA(std::vector<reco::Track, std::allocator<reco::Track> > const&, reco::BeamSpot const&, std::vector<reco::Vertex, std::allocator<reco::Vertex> > const&, std::vector<std::pair<float, bool>, std::allocator<std::pair<float, bool> > >&) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so)
==11218==    by 0x5F3E9AF7: TrackMVAClassifierBase::produce(edm::Event&, edm::EventSetup const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libRecoTrackerFinalTrackSelectors.so)
==11218==    by 0x4C6C4B0: edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4C50F1D: edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4BDCA18: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4BDCFB3: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x41E7F27: tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreConcurrency.so)
==11218==    by 0x640B290: tbb::detail::r1::task_dispatcher::execute_and_wait(tbb::detail::d1::task*, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) (task_dispatcher.h:322)
==11218==    by 0x4B6121A: edm::FinalWaitingTask::wait() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4B6ABC4: edm::EventProcessor::processRuns() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4B6B18E: edm::EventProcessor::runToCompletion() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4074EE: tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
==11218==    by 0x63F79BC: tbb::detail::r1::task_arena_impl::execute(tbb::detail::d1::task_arena_base&, tbb::detail::d1::delegate_base&) (arena.cpp:688)
==11218==    by 0x408ED1: main::{lambda()#1}::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
==11218==    by 0x40518B: main (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
client stack range: [0x1FFEFAF000 0x1FFEFF6FFF] client SP: 0x1FFEFF1060
valgrind stack range: [0x1008AA6000 0x1008BA5FFF] top usage: 15736 of 1048576

makortel · 2023-11-30T21:28:52Z

Just shooting in the dark, I increased the stack size per thread from 10 MB to 20 MB with process.options.sizeOfStackForThreadsInKB = 20*1024, and with that I was able to run workflow 136.889 step 3 customized to run initialStep through IgProf. The full step 3 configuration, however, still crashed. So maybe stack size does not play a role.

VinInn · 2023-12-01T09:14:24Z

the heap-allocator "meta-data" are most probably corrupted. is this with JeMalloc? Maybe Tensorflow makes assumptions that JeMalloc does not comply.( alignment?)

makortel · 2023-12-01T14:22:25Z

the heap-allocator "meta-data" are most probably corrupted. is this with JeMalloc? Maybe Tensorflow makes assumptions that JeMalloc does not comply.( alignment?)

My recent tests were indeed with jemalloc. My HLT test case crashed with glibc malloc as well (#42444 (comment)) when ran through VTune. I could test the offline workflows and IgProf with glibc malloc too.

gartung · 2024-01-17T14:04:24Z

OneDNN can set the instruction set architecture used for the JITed code through an environment variable ONEDNN_MAX_CPU_ISA

https://www.intel.com/content/www/us/en/docs/onednn/developer-guide-reference/2023-1/cpu-dispatcher-control.html#DOXID-DEV-GUIDE-CPU-DISPATCHER-CONTROL

smuzaffar · 2024-01-17T21:54:04Z

OneDNN can set the instruction set architecture used for the JITed code through an environment variable ONEDNN_MAX_CPU_ISA

https://www.intel.com/content/www/us/en/docs/onednn/developer-guide-reference/2023-1/cpu-dispatcher-control.html#DOXID-DEV-GUIDE-CPU-DISPATCHER-CONTROL

@gartung , did you try setting it to SSE41 and do you still get segfaults?

gartung · 2024-01-17T22:48:34Z

The oneDNN library must be compiled with cmake option ONEDNN_ENABLE_MAX_CPU_ISA=1. Not sure how Tensorflow compiles oneDNN.

smuzaffar · 2024-01-17T23:03:42Z

The oneDNN library must be compiled with cmake option ONEDNN_ENABLE_MAX_CPU_ISA=1. Not sure how Tensorflow compiles oneDNN.

ONEDNN_ENABLE_MAX_CPU_ISA is by default ON. I have checked TF sources and they do not explicitly set it so it should already be enabled

gartung · 2024-01-18T15:56:58Z

Adding TF_ENABLE_ONEDNN_OPTS=0 and ONEDNN_MAX_CPU_ISA=SSE41 to the environment did not prevent the segfault.

gartung · 2024-01-18T20:21:32Z

JIT profiling can be disabled in the code by passing 0 to dnnl_set_jit_profiling_flags
https://www.intel.com/content/www/us/en/docs/onednn/developer-guide-reference/2023-1/service.html#DOXID-GROUP-DNNL-API-SERVICE-1GA51EF634E4F201A12D32E573955943F48

gartung · 2024-01-18T20:54:54Z

There is an example of running perf on the jitted code with the appropriate environment variables
https://www.intel.com/content/www/us/en/docs/onednn/developer-guide-reference/2023-1/profiling-onednn-performance.html#EXAMPLE-PROFILING-WITH-LINUX-PERF

gartung · 2024-01-19T16:12:02Z

Setting JITDUMPDIR=. ONEDNN_JIT_PROFILE=14 ONEDNN_MAX_CPU_ISA=SSE41 TF_ENABLE_ONEDNN_OPTS=0 in the profiling script before running Igprof has so far allowed the Jenkins job to complete without segfaulting.

gartung · 2024-01-20T15:33:24Z

I was incorrect about the environment variables before. I think I had TF_ENABLE_ONEDNN_OPTS=1. There were two jobs that were killed after 8 hours stuck in step 4. I can only assume that restriction to SSE41 slowed the inference down significantly. I changed the environment variables to just JITDUMPDIR=. ONEDNN_JIT_PROFILE=14 and ended up with segfaults. I am trying this set of environment variables next: JITDUMPDIR=. ONEDNN_JIT_PROFILE=14 ONEDNN_MAX_CPU_ISA=AVX2 TF_ENABLE_ONEDNN_OPTS=1

gartung · 2024-01-21T01:32:18Z

I made one more try with ONEDNN_CPU_ISA_HINTS=PREFER_YMM and got an abort in tensorflow

----- Begin Fatal Exception 21-Jan-2024 02:19:38 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 3 stream: 0
   [1] Running path 'AODSIMoutput_step'
   [2] Prefetching for module AsciiOutputModule/'AODSIMoutput'
   [3] Prefetching for module ReducedRecHitCollectionProducer/'reducedEcalRecHitsEB'
   [4] Prefetching for module InterestingDetIdCollectionProducer/'interestingEcalDetIdRefinedEB'
   [5] Prefetching for module PFEGammaProducer/'particleFlowEGamma'
   [6] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [7] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [8] Prefetching for module GsfTrackProducer/'electronGsfTracks'
   [9] Prefetching for module CkfTrackCandidateMaker/'electronCkfTrackCandidates'
   [10] Prefetching for module ElectronSeedMerger/'electronMergedSeeds'
   [11] Prefetching for module ElectronSeedProducer/'ecalDrivenElectronSeeds'
   [12] Prefetching for module PFECALSuperClusterProducer/'particleFlowSuperClusterHGCal'
   [13] Prefetching for module PFClusterProducer/'particleFlowClusterHGCal'
   [14] Prefetching for module TrackstersMergeProducer/'ticlTrackstersMerge'
   [15] Calling method for module TrackstersProducer/'ticlTrackstersCLUE3DHigh'
Exception Message:
error while running session: ABORTED: Operation received an exception:Status: 2, message: could not execute a primitive, in file tensorflow/core/kernels/mkl/mkl_matmul_op_fused.cc:272
	 [[{{node model/dense2/Relu}}]]

smuzaffar · 2024-01-30T14:10:23Z

FYI, we now have CMSSW_14_0_MKLDNN0_X IBs where tensorflow is built with tensorflow_mkldnn_contraction_kernel=0 . WF 140.201 step2 runs fine under vtune and igprof

gartung · 2024-01-30T15:50:11Z

The results I got from running the run-ib-profiling still showed segfaults, although they are not in tensorflow using modules
https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/el8_amd64_gcc12/profiling/

gartung · 2024-01-30T16:15:55Z

@smuzaffar I saw this commit while looking at libeigen on gitlab. Could it be related?
"Don't crash on empty tensor contraction."
https://gitlab.com/libeigen/eigen/-/commit/b0f877f8e01e90a5b0f3a79d46ea234899f8b499

valsdav · 2024-01-30T17:33:18Z

DeepTauId model is manipulating very sparse tensors, can it be that eigen is considering zero tensors as empty tensors and then contracting them?

This may be correlated with the fact that this crash does not happen on other TF models.

makortel · 2024-01-30T18:51:25Z

This may be correlated with the fact that this crash does not happen on other TF models.

There were some occurrences of TrackMVAClassifier

makortel · 2024-01-30T21:07:30Z

The results I got from running the run-ib-profiling still showed segfaults, although they are not in tensorflow using modules https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/el8_amd64_gcc12/profiling/

I vaguely recall we saw some cases before too where the crash was in a destructor, like in

Thread 1 (Thread 0x7fdc9b61c680 (LWP 3057894) "cmsRun"):
#3  0x00007fdc942bf720 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fdc2f5ea773 in edm::Wrapper<std::vector<TrajectorySeed, std::allocator<TrajectorySeed> > >::~Wrapper() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/pluginRecoTrackerTkSeedGeneratorPlugins.so
#6  0x00007fdc9eb4d2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#7  0x00007fdc9ebf22c6 in edm::DataManagingProductResolver::resetProductData_(bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#8  0x00007fdc9ebe395b in edm::Principal::clearPrincipal() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#9  0x00007fdc9eb5daad in edm::EventPrincipal::clearEventPrincipal() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#10 0x00007fdc9eb97397 in edm::FunctorWaitingTask<edm::waiting_task::detail::WaitingTaskChain<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#4}>, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#3}> >, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#2}> >, edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#1}> >::runLast(edm::WaitingTaskHolder)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() [clone .lto_priv.0] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so

https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/el8_amd64_gcc12/profiling/11834.21/step3_igprof_cpu.log

My interpretation was/is that this kind of crash is compatible with memory corruption, which seems to me to be overarching theme of these problems.

smuzaffar · 2024-01-31T14:00:39Z

@smuzaffar I saw this commit while looking at libeigen on gitlab. Could it be related? "Don't crash on empty tensor contraction." https://gitlab.com/libeigen/eigen/-/commit/b0f877f8e01e90a5b0f3a79d46ea234899f8b499

@gartung , I have tested this eigen fix cms-externals/eigen-git-mirror#8 but both igprof and vtune still crash. I tried running #42444 (comment) after setting the env from /cvmfs/cms-ci.cern.ch/week0/cms-externals/eigen-git-mirror/8/37124/CMSSW_14_0_X_2024-01-30-2300

gartung · 2024-02-04T19:31:15Z

I changed run-ib-profiling to run Vtune instead of Igprof and there are segfaults running workflows involving Tensorflow
https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_14_0_X_2024-02-04-0000/el8_amd64_gcc12/profiling/

smuzaffar · 2024-02-16T09:16:20Z

cms-sw/cmsdist#9021 includes the Don't crash on empty tensor contraction fix. Note that this fix is already in eigen version used by TF_X Ibs (which are based on tensorlfow 2.15 + newer eigen)

makortel · 2024-09-06T14:42:13Z

FYI, we now have CMSSW_14_0_MKLDNN0_X IBs where tensorflow is built with tensorflow_mkldnn_contraction_kernel=0 . WF 140.201 step2 runs fine under vtune and igprof

I started to wonder if a Tensorflow built with tensorflow_mkldnn_contraction_kernel=0 would be ABI compatible with our production TF build? (thinking feasibility of keeping the present TF+mkldnn setup in production, but allowing it to be easily switched to another build that would work with profilers)

makortel · 2024-09-11T21:26:46Z

I ran some tests again with VTune (2024.2) on CMSSW_14_1_0_pre7 on EL8 natively.

With 12634.21 (2023 TTBar+PU MC ProdLike) step3 I ran 4 attempts with all succeeding (I didn't check the profiles though).

With 136.889 (2018D MET data) step3 I reproduce a failure, although now with exceptions of

----- Begin Fatal Exception 11-Sep-2024 22:02:40 CEST-----------------------
An exception of category 'Configuration' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=PhotonMVAValueMapProducer label='photonMVAValueMapProducer'
Exception Message:
failed to parse ""
----- End Fatal Exception -------------------------------------------------

or

----- Begin Fatal Exception 11-Sep-2024 22:28:49 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=PhotonMVAValueMapProducer label='photonMVAValueMapProducer'
   Additional Info:
      [a] Fatal Root Error: @SUB=TFile::TFile
file /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_0_pre7/external/el8_amd64_gcc12/data/RecoEgamma/PhotonIdentification/data/MVA/Fall17/EE_V2.weights.root can not be opened for reading Too many open files

----- End Fatal Exception -------------------------------------------------

I tested the environment variables

ONEDNN_MAX_CPU_ISA=SSE41
ONEDNN_JIT_PROFILE=1
TF_ENABLE_ONEDNN_OPTS=0

separately and all together, and all resulted a failure.

fwyzard · 2024-10-22T21:24:57Z

I started to wonder if a Tensorflow built with tensorflow_mkldnn_contraction_kernel=0 would be ABI compatible with our production TF build? (thinking feasibility of keeping the present TF+mkldnn setup in production, but allowing it to be easily switched to another build that would work with profilers)

Would it make sense (and work ?) to wrap the calls to TensorFlow with

#include <ittnotify.h>

...

__itt_pause();   // Pause profiling
// call to TensorFlow goes here
__itt_resume();  // Resume profiling

?

fwyzard · 2024-10-23T09:18:35Z

Mhm, no, because __itt_pause()/__itt_resume() affect the whole application, not just the calling thread :-/

gartung · 2024-10-23T19:49:14Z

I have a draft pull request for cmsdist that sets enable-tf-mkldnn in the rpm spec file
cms-sw/cmsdist#9471 (comment)
Running workflows 23834.21 and 12634.21 on cmsdev40 with export TF_ENABLE_ONEDNN_OPTS=1 has not produced any crashes.

makortel · 2024-10-24T22:10:39Z

With the CMSSW installation provided by cms-sw/cmsdist#9471 (comment) workflow 136.889 step 3 + VTune still fails for me, with or without TF_ENABLE_ONEDNN_OPTS=1.

@gartung I thought disabling the OneDNN JITting helped earlier?

gartung · 2024-10-25T13:09:59Z

I thought so too. Maybe it was a different tensor flow build option.

cmsbuild added the pending-assignment label Aug 2, 2023

cmsbuild added reconstruction-pending pending-signatures and removed pending-assignment labels Aug 2, 2023

mbluj mentioned this issue Aug 2, 2023

Use make_unique in tau modules #42447

Merged

makortel mentioned this issue Aug 2, 2023

Failures in Run 3 data reprocessing #40437

Open

This was referenced Mar 4, 2024

Updated root to tip of branch master cms-sw/cmsdist#9034

Closed

Dealing with RooFit vectorization target #44308

Open

DeepTauId failures in RelVals (Incompatible shapes) #44333

Closed

jfernan2 mentioned this issue Jul 2, 2024

Remove data workflows from profiling in PRs cms-sw/cms-bot#2280

Merged

DeepTauId throws (in event with zero taus?) #42444

DeepTauId throws (in event with zero taus?) #42444

Comments

VinInn commented Aug 2, 2023

cmsbuild commented Aug 2, 2023 • edited Loading

VinInn commented Aug 2, 2023 • edited Loading

VinInn commented Aug 2, 2023

slava77 commented Aug 2, 2023

makortel commented Aug 2, 2023

cmsbuild commented Aug 2, 2023

makortel commented Nov 16, 2023 • edited Loading

makortel commented Nov 16, 2023

makortel commented Nov 17, 2023

makortel commented Nov 17, 2023

makortel commented Nov 17, 2023

makortel commented Nov 17, 2023

makortel commented Nov 17, 2023 • edited Loading

makortel commented Nov 17, 2023 • edited Loading

makortel commented Nov 22, 2023

kandrosov commented Nov 23, 2023

makortel commented Nov 29, 2023

makortel commented Nov 29, 2023

makortel commented Nov 29, 2023

makortel commented Nov 30, 2023 • edited Loading

makortel commented Nov 30, 2023

makortel commented Nov 30, 2023 • edited Loading

VinInn commented Dec 1, 2023 • edited Loading

makortel commented Dec 1, 2023

gartung commented Jan 17, 2024

smuzaffar commented Jan 17, 2024

gartung commented Jan 17, 2024

smuzaffar commented Jan 17, 2024

gartung commented Jan 18, 2024

gartung commented Jan 18, 2024

gartung commented Jan 18, 2024

gartung commented Jan 19, 2024 • edited Loading

gartung commented Jan 20, 2024

gartung commented Jan 21, 2024

smuzaffar commented Jan 30, 2024

gartung commented Jan 30, 2024

gartung commented Jan 30, 2024

valsdav commented Jan 30, 2024

makortel commented Jan 30, 2024

makortel commented Jan 30, 2024

smuzaffar commented Jan 31, 2024

gartung commented Feb 4, 2024

smuzaffar commented Feb 16, 2024

makortel commented Sep 6, 2024

makortel commented Sep 11, 2024

fwyzard commented Oct 22, 2024

fwyzard commented Oct 23, 2024

gartung commented Oct 23, 2024

makortel commented Oct 24, 2024 • edited Loading

gartung commented Oct 25, 2024

cmsbuild commented Aug 2, 2023 •

edited

Loading

VinInn commented Aug 2, 2023 •

edited

Loading

makortel commented Nov 16, 2023 •

edited

Loading

makortel commented Nov 17, 2023 •

edited

Loading

makortel commented Nov 17, 2023 •

edited

Loading

makortel commented Nov 30, 2023 •

edited

Loading

makortel commented Nov 30, 2023 •

edited

Loading

VinInn commented Dec 1, 2023 •

edited

Loading

gartung commented Jan 19, 2024 •

edited

Loading

makortel commented Oct 24, 2024 •

edited

Loading