Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepTauId throws (in event with zero taus?) #42444

Open
VinInn opened this issue Aug 2, 2023 · 114 comments
Open

DeepTauId throws (in event with zero taus?) #42444

VinInn opened this issue Aug 2, 2023 · 114 comments

Comments

@VinInn
Copy link
Contributor

VinInn commented Aug 2, 2023

The original issue #40437 has taken a different path so I reopen here just for this.

Reminder of relevant post:
#40437 (comment)
#40437 (comment)
#28358 (closed. WHY?)

a log file
https://cms-unified.web.cern.ch/cms-unified/joblogs/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693/8001/DataProcessing/020da37f-6871-4688-a86e-2b7e8a6bc683-26-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

here the "shooting gun": trying to reproduce the event happen to have ZERO taus
#40437 (comment)

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2023

A new Issue was created by @VinInn Vincenzo Innocente.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@VinInn
Copy link
Contributor Author

VinInn commented Aug 2, 2023

I still do not understand how a memory corruption can modify the values in an otherwise empty vector (pointer, size, capacity, all 0) w/o causing a segfault before reaching the throw.
I'n my opinion there is a legit vector there with "something" in it that reasonably fake a tau.

@VinInn
Copy link
Contributor Author

VinInn commented Aug 2, 2023

the tau collection is a view: no way to survive even the first deference

@slava77
Copy link
Contributor

slava77 commented Aug 2, 2023

it sounds also there is a reproducibility problem.
The PR tests cover the single thread cases somewhat (even though not on the right SSE-only hardware)

Could it be multithreaded reproducibility issue?
Perhaps just running reco comparisons on MT job outputs is enough to get an idea.

@makortel
Copy link
Contributor

makortel commented Aug 2, 2023

assign reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2023

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Nov 16, 2023

Let me tag @cms-sw/tau-pog-l2 also in this issue

@makortel
Copy link
Contributor

Just for the record, I'm hitting into this issue while trying to run an HLT job through VTune. Without VTune the configuration works, but within VTune the job crashes in a few different ways, this exception being one of them.

@makortel
Copy link
Contributor

Running VTune on a single-thread job in ASAN build resulted in

AddressSanitizer: CHECK failed: asan_allocator.cpp:188 "((old)) == ((kAllocBegMagic))" (0x302030303a303020, 0xcc6e96b9cc6e96b9) (tid=2772)
    #0 0x7fddac36d4ba in CheckUnwind ../../../../libsanitizer/asan/asan_rtl.cpp:67
    #1 0x7fddac38de65 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../libsanitizer/sanitizer_common/sanitizer_termination.cpp:86
    #2 0x7fddac2d845b in __asan::LargeChunkHeader::Set(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:188
    #3 0x7fddac2d845b in __asan::LargeChunkHeader::Set(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:178
    #4 0x7fddac2d845b in __asan::QuarantineCallback::Recycle(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:204
    #5 0x7fddac2d8655 in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::DoRecycle(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:193
    #6 0x7fddac2d8b3d in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Recycle(unsigned long, __asan::QuarantineCallback) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:181
    #7 0x7fddac2d411c in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Put(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback, __asan::AsanChunk*, unsigned long) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:112
    #8 0x7fddac2d411c in __asan::Allocator::QuarantineChunk(__asan::AsanChunk*, void*, __sanitizer::BufferedStackTrace*) ../../../../libsanitizer/asan/asan_allocator.cpp:665
    #9 0x7fddac3655ad in operator delete(void*, unsigned long) ../../../../libsanitizer/asan/asan_new_delete.cpp:164
    #10 0x7fdd6f44320f in std::_Sp_counted_ptr_inplace<BasicSingleTrajectoryState, churn_allocator<BasicSingleTrajectoryState>, (__gnu_cxx::_Lock_policy)2>::_M_destroy() (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libTrackingToolsTrajectoryState.so+0x5e20f)
    #11 0x7fdc6843beed in CkfTrajectoryBuilder::limitedCandidates(std::shared_ptr<TrajectorySeed const> const&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<Trajectory, std::allocator<Trajectory> >&) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf4eed)
    #12 0x7fdc6843fbb6 in CkfTrajectoryBuilder::limitedCandidates(TrajectorySeed const&, TempTrajectory&, std::vector<Trajectory, std::allocator<Trajectory> >&) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf8bb6)
    #13 0x7fdc6844079c in CkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf979c)
    #14 0x7fdc683c9306 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0x82306)
    #15 0x7fdc683d04fe in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0x894fe)
    #16 0x7fddac0f6dd0 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so+0x986dd0)

that hints towards ASAN own data structures being overwritten

@makortel
Copy link
Contributor

Running VTune on a single-thread job of cmsRunGlibC resulted in

*** Error in `cmsRunGlibC': corrupted size vs. prev_size: 0x0000000047f1cc00 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f474)[0x7f8d1b99f474]
/lib64/libc.so.6(+0x816a4)[0x7f8d1b9a16a4]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9eadc44)[0x7f8c7f1a0c44]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9eae8bb)[0x7f8c7f1a18bb]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9fd2efd)[0x7f8c7f2c5efd]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9af7e27)[0x7f8c7edeae27]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x97995d3)[0x7f8c7ea8c5d3]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x71123b7)[0x7f8c7c4053b7]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZNK5Eigen30TensorContractionEvaluatorBaseINS_15TensorEvaluatorIKNS_19TensorContractionOpIKNS_5arrayINS_9IndexPairIlEELm1EEEKNS_17TensorReshapingOpIKNS_6DSizesIlLi2EEEKNS_18TensorImagePatchOpILln1ELln1EKNS_9TensorMapINS_6TensorIKfLi4ELi1ElEELi16ENS_11MakePointerEEEEEEEKNS8_ISB_SJ_EEKN10tensorflow33LaunchFusedConv2DWithOutputKernelIfE19OutputKernelWrapperEEENS_16ThreadPoolDeviceEEEE15evalGemmPartialILb1ELb1ELb0ELi0ELb1EEEvPflli+0x7ad)[0x7f8c7c455c6d]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN5Eigen8internal14TensorExecutorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIfLi4ELi1ElEELi16ENS_11MakePointerEEEKNS_17TensorReshapingOpIKNS_6DSizesIlLi4EEEKNS_19TensorContractionOpIKNS_5arrayINS_9IndexPairIlEELm1EEEKNS8_IKNS9_IlLi2EEEKNS_18TensorImagePatchOpILln1ELln1EKNS3_INS4_IKfLi4ELi1ElEELi16ES6_EEEEEEKNS8_ISJ_SO_EEKN10tensorflow33LaunchFusedConv2DWithOutputKernelIfE19OutputKernelWrapperEEEEEEENS_16ThreadPoolDeviceELb1ELNS0_15TiledEvaluationE0EE3runERS15_RKS16_+0x95)[0x7f8c7c487365]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x7195164)[0x7f8c7c488164]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow33LaunchFusedConv2DWithOutputKernelIfEclINS_19BiasAddOutputKernelIfNS_8IdentityEEEEEvRKT_PNS_15OpKernelContextERKNS_6TensorESD_PSB_+0x1ed)[0x7f8c7c4896ad]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13FusedConv2DOpIN5Eigen16ThreadPoolDeviceEfE7ComputeEPNS_15OpKernelContext[39/1803]0x7f8c7c48e1e5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(_ZN10tensorflow16ThreadPoolDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x4b)[0x7f8c74ae2f6b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1258ef1)[0x7f8c74bd0ef1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1259fd6)[0x7f8c74bd1fd6]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN3tsl6thread10ThreadPool8ScheduleESt8functionIFvvEE+0xfc)[0x7f8c81c8c99c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0xcad7d78)[0x7f8c81dcad78]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x123e924)[0x7f8c74bb6924]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x124436e)[0x7f8c74bbc36e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x124dfa5)[0x7f8c74bc5fa5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x125820c)[0x7f8c74bd020c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1259fd6)[0x7f8c74bd1fd6]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN3tsl6thread10ThreadPool8ScheduleESt8functionIFvvEE+0xfc)[0x7f8c81c8c99c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0xcad7d78)[0x7f8c81dcad78]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1244271)[0x7f8c74bbc271]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1248f08)[0x7f8c74bc0f08]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession11RunInternalElRKNS_10RunOptionsEPNS_18CallFrameInterfaceEPNS0_16ExecutorsAndKeysEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x7fa)[0x7f8c81ddd61a]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession3RunERKNS_10RunOptionsERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISD_EERKS4_ISB_SaISB_EESL_PS4_ISC_SaISC_EEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x96e)[0x7f8c81de034e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EERKN3tsl6thread17ThreadPoolOptionsE+0xa5)[0x7f8c876b54d5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EEPN3tsl6thread19ThreadPoolInterfaceE+0x19)[0x7f8c876b5589]
/build/mkortela/debug/hlt/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so(+0xa729a)[0x7f8bf985f29a]
/build/mkortela/debug/hlt/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so(+0x9bb74)[0x7f8bf9853b74]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm6stream21EDProducerAdaptorBase7doEventERKNS_19EventTransitionInfoEPNS_16ActivityRegistryEPKNS_20ModuleCallingContextE+0x141)[0x7f8d1e8f54b1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8a19)[0x7f8d1e865a19]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8fb4)[0x7f8d1e865fb4]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreConcurrency.so(+0x5f28)[0x7f8d253faf28]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x28083)[0x7f8d1cb4c083]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x13d21b)[0x7f8d1e7ea21b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor11processRunsEv+0x205)[0x7f8d1e7f3bc5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor15runToCompletionEv+0x23f)[0x7f8d1e7f418f]
cmsRunGlibC[0x4074ef]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x149bd)[0x7f8d1cb389bd]
cmsRunGlibC[0x408ed2]
cmsRunGlibC(main+0x14c)[0x40518c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f8d1b942555]
cmsRunGlibC[0x405451]
======= Memory map: ========
00400000-00404000 r--p 00000000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00404000-00405000 r-xp 00004000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00405000-0040a000 r-xp 00005000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040a000-0040d000 r--p 0000a000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040d000-0040e000 r--p 0000c000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040e000-0040f000 rw-p 0000d000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040f000-00410000 rwxp 00000000 00:00 0
00419000-4d534000 rw-p 00000000 00:00 0                                  [heap]
7f8bd8000000-7f8bd8021000 rw-p 00000000 00:00 0
7f8bd8021000-7f8bdc000000 ---p 00000000 00:00 0
7f8bdee6c000-7f8be100b000 rw-p 00000000 00:00 0
7f8be305c000-7f8be314c000 rwxp 00000000 00:00 0
7f8be314c000-7f8be8000000 rw-p 00000000 00:00 0
7f8be8000000-7f8be8021000 rw-p 00000000 00:00 0
7f8be8021000-7f8bec000000 ---p 00000000 00:00 0
7f8bec03d000-7f8bec35d000 rwxp 00000000 00:00 0
7f8bec35d000-7f8bec379000 rw-p 00000000 00:00 0
7f8bec379000-7f8bec3c9000 rwxp 00000000 00:00 0
7f8bec3e5000-7f8bec435000 rwxp 00000000 00:00 0
7f8bec444000-7f8bec44a000 r--p 00000000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec450000-7f8bec4a0000 rwxp 00000000 00:00 0
7f8bec4a0000-7f8bec4a2000 r--p 00000000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a2000-7f8bec4a3000 r-xp 00002000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a3000-7f8bec4a4000 r--p 00003000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a4000-7f8bec4a5000 r--p 00003000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a5000-7f8bec4a6000 rw-p 00004000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a6000-7f8bec507000 rwxp 00000000 00:00 0

@makortel
Copy link
Contributor

We updated tensorflow to 2.12 in 13_3_0_pre1 (built on Aug 3). I wonder if there would be any temporal correlation with the RECO profiling crashes?

@makortel
Copy link
Contributor

makortel commented Nov 17, 2023

At least in CMSSW_13_2_X_2023-11-17-1100 the IgProf job on step3 of workflow 136.889 succeeds, whereas in CMSSW_13_3_X_2023-11-17-1100 the job crashes (but of course there are many other changes in between)

@makortel
Copy link
Contributor

makortel commented Nov 17, 2023

(maybe I'm going in a wrong direction, given that the DeepTauId throwing an exception on nan is known for several years #28358)

@makortel
Copy link
Contributor

In my HLT menu test case with VTune, if I "disable" the two DeepTauId modules by adding an always-rejecting EDFilter before them, the job succeeds. Enabling either one, or both, of the DeepTauId modules makes the job crash most of the time.

@kandrosov
Copy link
Contributor

Hi @makortel. By "succeed", do you mean just to finish without crashing, or the outputs are always identical? Especially for other taggers that use TF. In the past, we couldn't conclude with 100% certainty whether memory corruption happens in the DeepTau module or it occures elsewhere and corrupts DeepTau-related memory. Is the machine on which one could reproduce the crash accessible by normal users? If yes, could you please share instructions on how one can reproduce the crash?

@makortel
Copy link
Contributor

Hi @kandrosov

By "succeed", do you mean just to finish without crashing, or the outputs are always identical?

I mean "finishes without crashing". I'm not checking the outputs in any way.

I sent you privately the HLT recipe that I used. By adding an event rejecting EDFilter before/after the DeepTauId modules in the menu I see the behavior that

  • if the EDFilter is right before DeepTauId module, the VTune profiling job (technically) succeeds
  • if the EDFilter is right after DeepTauId module, the VTune profiling job crashes

I started also look into the step3 of the workflow 136.889, that is used in the IgProf profiling in IBs (which is currently crashing in 13_3_X and 14_0_X). Running the step3 as it is indeed crashes also in VTune, in a way that looks like memory corruption. If I run everything up to and including the DeepTauId modules there (*), the job still crashes. But, contrary to the HLT test case, if I run everything up to the modules DeepTauId consumes (**), but not DeepTauId, the job crashes as well. This behavior would indicate that something else than (or maybe in addition to?) DeepTauId would cause memory corruption when the job is run through VTune.

I'll continue to investigate (albeit slowly).


(*) by adding a snippet

process.deepTauSequence = cms.Sequence(process.deepTau2017v2p1ForMini+process.deepTau2018v2p5ForMini
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(
    process.raw2digi_step,
    process.reconstruction_step,
    process.tauPath
)

to the end of the step3 configuration


(**) by adding a snippet

process.deepTauSequence = cms.Sequence(
    process.offlineSlimmedPrimaryVertices
    + process.packedPFCandidates
    + process.hpsPFTauTransverseImpactParameters
    + process.slimmedTausNoDeepIDs
    + process.slimmedElectrons
    + process.slimmedMuons
)
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(
    process.raw2digi_step,
    process.reconstruction_step,
    process.tauPath
)

to the end of the step3 configuration

@makortel
Copy link
Contributor

With the step3 of 136.889 I got to the point that running initialStep and everything it consumes crashes (under VTune), while running any of the modules initialStep consumes individually technically works. The smoking gun here is that initialStep module is of type TrackTfClassifier that uses Tensorflow. The stack trace of the working thread contains a few interesting frames, even if it is mostly corrupted

Thread 1 (Thread 0x7f2b0d5aa740 (LWP 21048) "cmsRun"):
#0  0x00007f2b0f464ddd in poll () from /lib64/libc.so.6
#1  0x00007f2b16f75841 in ?? ()
#2  0x01007f2b0d5a4858 in ?? ()
#3  0x00007ffc13af4400 in ?? ()
#4  0x00007ffc13af4920 in ?? ()
#5  0x00007ffc13af4410 in ?? ()
#6  0x00007ffc13af44a0 in ?? ()
#7  0x00007f2b176b1060 in ?? ()
#8  0x0000002100000001 in ?? ()
#9  0x00007ffc13af4390 in ?? ()
#10 0x00007f2b0d582240 in ?? ()
#11 0x00007f2b13d241c0 in ?? ()
#12 0x000000006567babf in ?? ()
#13 0x0000000000098c43 in ?? ()
#14 0x00007f2a1c780320 in ?? ()
#15 0x00007f2a71cc753b in tensorflow::SimplePropagatorState::SimplePropagatorState(tensorflow::ImmutableExecutorState const&, long, tensorflow::ImmutableExecutorState::FrameInfo const&, bool) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2
#16 0x00007f2ad712db5f in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#17 0x00007f2ad70e650c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#18 0x00007f2ad70e6e70 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#19 0x00007f2b1715b0fe in ?? ()
#20 0x00007ffc13af5730 in ?? ()
#21 0x00007f2b15106698 in ?? ()
#22 0x000000000000000b in ?? ()
#23 0x00007ffc13af5730 in ?? ()
#24 0x00007ffc13af5170 in ?? ()
#25 0x00007f2b15106060 in ?? ()
#26 0x00007ffc13af5270 in ?? ()
#27 0x00007f2b1715c416 in ?? ()
#28 0x000000000000000b in ?? ()
#29 0x00007ffc13af5600 in ?? ()
#30 0x0000000000000000 in ?? ()

Recipe to reproduce (at CERN) is more or less

cmsrel CMSSW_14_0_0_pre0
cd CMSSW_14_0_0_pre0/src
cmsenv
runTheMatrix -l 136.889 --ibeos
cd 136.889_RunMET2018D
cat >>step3_RAW2DIGI_L1Reco_RECO_SKIM_PAT_ALCA_DQM.py <<EOF
def make(name):
    print(f"Only up to {name}")
    process.testSequence = cms.Sequence(getattr(process, name))
make("initialStep") # should crash
# immediate dependencies of initialStep
#make("offlineBeamSpot")
#make("firstStepPrimaryVertices")
#make("initialStepTracks")
process.testPath = cms.Path(process.testSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(process.raw2digi_step, process.reconstruction_step, process.testPath)
EOF

source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
vtune -collect hotspots -- cmsRun step3_RAW2DIGI_L1Reco_RECO_SKIM_PAT_ALCA_DQM.py

@makortel
Copy link
Contributor

To be again more specific for DeepTauId, I took workflow 140.201 that runs re-MINI, and was able to reproduce the behavior I saw in the earlier HLT job, i.e. including DeepTauId crashes (under VTune), but running only up to the modules consumed by the DeepTauId modules technically works.

Recipe to reproduce at CERN

cmsrel CMSSW_14_0_0_pre0
cd CMSSW_14_0_0_pre0/src
cmsenv
runTheMatrix -l 140.201 --ibeos
cd 140.201_RunJetMET2022D_reMINI
cat >>step2_PAT.py <<EOF
process.deepTauSequence = cms.Sequence(process.deepTau2017v2p1ForMini+process.deepTau2018v2p5ForMini)
# uncomment for the modules consumed by DeepTauId
#process.deepTauSequence = cms.Sequence(process.offlineSlimmedPrimaryVertices+process.packedPFCandidates+process.slimmedTausNoDeepIDs+process.slimmedElectrons+process.slimmedMuons)
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(process.tauPath)
EOF

source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
export SITECONFIG_PATH=/cvmfs/cms-ib.cern.ch/SITECONF/local
vtune -collect hotspots -- cmsRun step2_PAT.py

@makortel
Copy link
Contributor

makortel commented Nov 30, 2023

I repeated my tests in #42444 (comment) and #42444 (comment) with IgProf (*), and see exactly the same behavior. I.e. running up to the modules that initialStep / DeepTauId consume technically works, but running initialStep / DeepTauId results in crash.

(*) replace

source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
vtune -collect hotspots -- <command>

with

igprof -d -t cmsRun -pp -z -o igprof.perf.gz <command>

FYI @gartung . I think we should add crash detection to the IB profiling jobs. Something along after detecting a crash in some step of the workflow, not running the subsequent steps, and communicating the failure to the IB dashboard (e.g. similarly to the HLT validation tests that show a red backround when there are failures).

@makortel
Copy link
Contributor

With IgProf I managed to run it together with Valgrind, and got (on 136.889 step 3 customized to run initialStep and its dependencies)

valgrind: m_mallocfree.c:278 (mk_plain_bszB): Assertion 'bszB != 0' failed.
valgrind: This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.


host stacktrace:
==11218==    at 0x5803F638: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x5803F757: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x5803F8DE: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58046B96: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58047679: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x5800485C: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58004CE7: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58004EB9: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x580933DB: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x580DC945: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 11218)
==11218==    at 0x402EF11: operator new(unsigned long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==11218==    by 0x55DEC210: tensorflow::OpKernelContext::allocate_output(int, tensorflow::TensorShape const&, tensorflow::Tensor**, tsl::AllocatorAttributes) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x55DEC6D0: tensorflow::OpKernelContext::allocate_output(int, tensorflow::TensorShape const&, tensorflow::Tensor**) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4A65B582: tensorflow::FusedMatMulOp<Eigen::ThreadPoolDevice, float>::Compute(tensorflow::OpKernelContext*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x5606CF6A: tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615AEF0: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615BFD5: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4F8BC99B: tsl::thread::ThreadPool::Schedule(std::function<void ()>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x4F9FAD77: std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&)::{lambda(std::function<void ()>)#6}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x56140923: void tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::RunTask<std::_Bind<void (tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::*(tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>*, tensorflow::SimplePropagatorState::TaggedNode, long))(tensorflow::SimplePropagatorState::TaggedNode, long)> >(std::_Bind<void (tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::*(tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>*, tensorflow::SimplePropagatorState::TaggedNode, long))(tensorflow::SimplePropagatorState::TaggedNode, long)>&&, int) [clone .constprop.0] (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5614636D: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ScheduleReady(absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5614FFA4: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::NodeDone(tsl::Status const&, absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::NodeExecStatsInterface*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615A20B: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615BFD5: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4F8BC99B: tsl::thread::ThreadPool::Schedule(std::function<void ()>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x4F9FAD77: std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&)::{lambda(std::function<void ()>)#6}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x56146270: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ScheduleReady(absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5614AF07: tensorflow::(anonymous namespace)::ExecutorImpl::RunAsync(tensorflow::Executor::Args const&, std::function<void (tsl::Status const&)>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4FA0D619: tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x4FA1034D: tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x42F1D4D4: tensorflow::run(tensorflow::Session*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so)
==11218==    by 0x42F1D588: tensorflow::run(tensorflow::Session*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tsl::thread::ThreadPoolInterface*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so)
==11218==    by 0x5F36B736: (anonymous namespace)::TfDnn::operator()(std::vector<reco::Track, std::allocator<reco::Track> > const&, reco::BeamSpot const&, std::vector<reco::Vertex, std::allocator<reco::Vertex> > const&) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so)
==11218==    by 0x5F36C085: TrackMVAClassifier<(anonymous namespace)::TfDnn, void>::computeMVA(std::vector<reco::Track, std::allocator<reco::Track> > const&, reco::BeamSpot const&, std::vector<reco::Vertex, std::allocator<reco::Vertex> > const&, std::vector<std::pair<float, bool>, std::allocator<std::pair<float, bool> > >&) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so)
==11218==    by 0x5F3E9AF7: TrackMVAClassifierBase::produce(edm::Event&, edm::EventSetup const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libRecoTrackerFinalTrackSelectors.so)
==11218==    by 0x4C6C4B0: edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4C50F1D: edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4BDCA18: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4BDCFB3: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x41E7F27: tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreConcurrency.so)
==11218==    by 0x640B290: tbb::detail::r1::task_dispatcher::execute_and_wait(tbb::detail::d1::task*, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) (task_dispatcher.h:322)
==11218==    by 0x4B6121A: edm::FinalWaitingTask::wait() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4B6ABC4: edm::EventProcessor::processRuns() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4B6B18E: edm::EventProcessor::runToCompletion() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4074EE: tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
==11218==    by 0x63F79BC: tbb::detail::r1::task_arena_impl::execute(tbb::detail::d1::task_arena_base&, tbb::detail::d1::delegate_base&) (arena.cpp:688)
==11218==    by 0x408ED1: main::{lambda()#1}::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
==11218==    by 0x40518B: main (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
client stack range: [0x1FFEFAF000 0x1FFEFF6FFF] client SP: 0x1FFEFF1060
valgrind stack range: [0x1008AA6000 0x1008BA5FFF] top usage: 15736 of 1048576

@makortel
Copy link
Contributor

makortel commented Nov 30, 2023

Just shooting in the dark, I increased the stack size per thread from 10 MB to 20 MB with process.options.sizeOfStackForThreadsInKB = 20*1024, and with that I was able to run workflow 136.889 step 3 customized to run initialStep through IgProf. The full step 3 configuration, however, still crashed. So maybe stack size does not play a role.

@VinInn
Copy link
Contributor Author

VinInn commented Dec 1, 2023

the heap-allocator "meta-data" are most probably corrupted. is this with JeMalloc? Maybe Tensorflow makes assumptions that JeMalloc does not comply.( alignment?)

@makortel
Copy link
Contributor

makortel commented Dec 1, 2023

the heap-allocator "meta-data" are most probably corrupted. is this with JeMalloc? Maybe Tensorflow makes assumptions that JeMalloc does not comply.( alignment?)

My recent tests were indeed with jemalloc. My HLT test case crashed with glibc malloc as well (#42444 (comment)) when ran through VTune. I could test the offline workflows and IgProf with glibc malloc too.

@gartung
Copy link
Member

gartung commented Jan 17, 2024

OneDNN can set the instruction set architecture used for the JITed code through an environment variable ONEDNN_MAX_CPU_ISA

https://www.intel.com/content/www/us/en/docs/onednn/developer-guide-reference/2023-1/cpu-dispatcher-control.html#DOXID-DEV-GUIDE-CPU-DISPATCHER-CONTROL

@smuzaffar
Copy link
Contributor

OneDNN can set the instruction set architecture used for the JITed code through an environment variable ONEDNN_MAX_CPU_ISA

https://www.intel.com/content/www/us/en/docs/onednn/developer-guide-reference/2023-1/cpu-dispatcher-control.html#DOXID-DEV-GUIDE-CPU-DISPATCHER-CONTROL

@gartung , did you try setting it to SSE41 and do you still get segfaults?

@gartung
Copy link
Member

gartung commented Jan 17, 2024

The oneDNN library must be compiled with cmake option ONEDNN_ENABLE_MAX_CPU_ISA=1. Not sure how Tensorflow compiles oneDNN.

@smuzaffar
Copy link
Contributor

The oneDNN library must be compiled with cmake option ONEDNN_ENABLE_MAX_CPU_ISA=1. Not sure how Tensorflow compiles oneDNN.

ONEDNN_ENABLE_MAX_CPU_ISA is by default ON. I have checked TF sources and they do not explicitly set it so it should already be enabled

@gartung
Copy link
Member

gartung commented Jan 18, 2024

Adding TF_ENABLE_ONEDNN_OPTS=0 and ONEDNN_MAX_CPU_ISA=SSE41 to the environment did not prevent the segfault.

@gartung
Copy link
Member

gartung commented Jan 18, 2024

@gartung
Copy link
Member

gartung commented Jan 18, 2024

There is an example of running perf on the jitted code with the appropriate environment variables
https://www.intel.com/content/www/us/en/docs/onednn/developer-guide-reference/2023-1/profiling-onednn-performance.html#EXAMPLE-PROFILING-WITH-LINUX-PERF

@gartung
Copy link
Member

gartung commented Jan 19, 2024

Setting JITDUMPDIR=. ONEDNN_JIT_PROFILE=14 ONEDNN_MAX_CPU_ISA=SSE41 TF_ENABLE_ONEDNN_OPTS=0 in the profiling script before running Igprof has so far allowed the Jenkins job to complete without segfaulting.

@gartung
Copy link
Member

gartung commented Jan 20, 2024

I was incorrect about the environment variables before. I think I had TF_ENABLE_ONEDNN_OPTS=1. There were two jobs that were killed after 8 hours stuck in step 4. I can only assume that restriction to SSE41 slowed the inference down significantly. I changed the environment variables to just JITDUMPDIR=. ONEDNN_JIT_PROFILE=14 and ended up with segfaults. I am trying this set of environment variables next: JITDUMPDIR=. ONEDNN_JIT_PROFILE=14 ONEDNN_MAX_CPU_ISA=AVX2 TF_ENABLE_ONEDNN_OPTS=1

@gartung
Copy link
Member

gartung commented Jan 21, 2024

I made one more try with ONEDNN_CPU_ISA_HINTS=PREFER_YMM and got an abort in tensorflow

----- Begin Fatal Exception 21-Jan-2024 02:19:38 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 3 stream: 0
   [1] Running path 'AODSIMoutput_step'
   [2] Prefetching for module AsciiOutputModule/'AODSIMoutput'
   [3] Prefetching for module ReducedRecHitCollectionProducer/'reducedEcalRecHitsEB'
   [4] Prefetching for module InterestingDetIdCollectionProducer/'interestingEcalDetIdRefinedEB'
   [5] Prefetching for module PFEGammaProducer/'particleFlowEGamma'
   [6] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [7] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [8] Prefetching for module GsfTrackProducer/'electronGsfTracks'
   [9] Prefetching for module CkfTrackCandidateMaker/'electronCkfTrackCandidates'
   [10] Prefetching for module ElectronSeedMerger/'electronMergedSeeds'
   [11] Prefetching for module ElectronSeedProducer/'ecalDrivenElectronSeeds'
   [12] Prefetching for module PFECALSuperClusterProducer/'particleFlowSuperClusterHGCal'
   [13] Prefetching for module PFClusterProducer/'particleFlowClusterHGCal'
   [14] Prefetching for module TrackstersMergeProducer/'ticlTrackstersMerge'
   [15] Calling method for module TrackstersProducer/'ticlTrackstersCLUE3DHigh'
Exception Message:
error while running session: ABORTED: Operation received an exception:Status: 2, message: could not execute a primitive, in file tensorflow/core/kernels/mkl/mkl_matmul_op_fused.cc:272
	 [[{{node model/dense2/Relu}}]]

@smuzaffar
Copy link
Contributor

FYI, we now have CMSSW_14_0_MKLDNN0_X IBs where tensorflow is built with tensorflow_mkldnn_contraction_kernel=0 . WF 140.201 step2 runs fine under vtune and igprof

@gartung
Copy link
Member

gartung commented Jan 30, 2024

The results I got from running the run-ib-profiling still showed segfaults, although they are not in tensorflow using modules
https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/el8_amd64_gcc12/profiling/

@gartung
Copy link
Member

gartung commented Jan 30, 2024

@smuzaffar I saw this commit while looking at libeigen on gitlab. Could it be related?
"Don't crash on empty tensor contraction."
https://gitlab.com/libeigen/eigen/-/commit/b0f877f8e01e90a5b0f3a79d46ea234899f8b499

@valsdav
Copy link
Contributor

valsdav commented Jan 30, 2024

DeepTauId model is manipulating very sparse tensors, can it be that eigen is considering zero tensors as empty tensors and then contracting them?

This may be correlated with the fact that this crash does not happen on other TF models.

@makortel
Copy link
Contributor

This may be correlated with the fact that this crash does not happen on other TF models.

There were some occurrences of TrackMVAClassifier

@makortel
Copy link
Contributor

The results I got from running the run-ib-profiling still showed segfaults, although they are not in tensorflow using modules https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/el8_amd64_gcc12/profiling/

I vaguely recall we saw some cases before too where the crash was in a destructor, like in

Thread 1 (Thread 0x7fdc9b61c680 (LWP 3057894) "cmsRun"):
#3  0x00007fdc942bf720 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fdc2f5ea773 in edm::Wrapper<std::vector<TrajectorySeed, std::allocator<TrajectorySeed> > >::~Wrapper() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/pluginRecoTrackerTkSeedGeneratorPlugins.so
#6  0x00007fdc9eb4d2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#7  0x00007fdc9ebf22c6 in edm::DataManagingProductResolver::resetProductData_(bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#8  0x00007fdc9ebe395b in edm::Principal::clearPrincipal() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#9  0x00007fdc9eb5daad in edm::EventPrincipal::clearEventPrincipal() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#10 0x00007fdc9eb97397 in edm::FunctorWaitingTask<edm::waiting_task::detail::WaitingTaskChain<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#4}>, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#3}> >, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#2}> >, edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#1}> >::runLast(edm::WaitingTaskHolder)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() [clone .lto_priv.0] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so

https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_14_0_MKLDNN0_X_2024-01-28-2300/el8_amd64_gcc12/profiling/11834.21/step3_igprof_cpu.log

My interpretation was/is that this kind of crash is compatible with memory corruption, which seems to me to be overarching theme of these problems.

@smuzaffar
Copy link
Contributor

@smuzaffar I saw this commit while looking at libeigen on gitlab. Could it be related? "Don't crash on empty tensor contraction." https://gitlab.com/libeigen/eigen/-/commit/b0f877f8e01e90a5b0f3a79d46ea234899f8b499

@gartung , I have tested this eigen fix cms-externals/eigen-git-mirror#8 but both igprof and vtune still crash. I tried running #42444 (comment) after setting the env from /cvmfs/cms-ci.cern.ch/week0/cms-externals/eigen-git-mirror/8/37124/CMSSW_14_0_X_2024-01-30-2300

@gartung
Copy link
Member

gartung commented Feb 4, 2024

I changed run-ib-profiling to run Vtune instead of Igprof and there are segfaults running workflows involving Tensorflow
https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_14_0_X_2024-02-04-0000/el8_amd64_gcc12/profiling/

@smuzaffar
Copy link
Contributor

cms-sw/cmsdist#9021 includes the Don't crash on empty tensor contraction fix. Note that this fix is already in eigen version used by TF_X Ibs (which are based on tensorlfow 2.15 + newer eigen)

@makortel
Copy link
Contributor

makortel commented Sep 6, 2024

FYI, we now have CMSSW_14_0_MKLDNN0_X IBs where tensorflow is built with tensorflow_mkldnn_contraction_kernel=0 . WF 140.201 step2 runs fine under vtune and igprof

I started to wonder if a Tensorflow built with tensorflow_mkldnn_contraction_kernel=0 would be ABI compatible with our production TF build? (thinking feasibility of keeping the present TF+mkldnn setup in production, but allowing it to be easily switched to another build that would work with profilers)

@makortel
Copy link
Contributor

I ran some tests again with VTune (2024.2) on CMSSW_14_1_0_pre7 on EL8 natively.

With 12634.21 (2023 TTBar+PU MC ProdLike) step3 I ran 4 attempts with all succeeding (I didn't check the profiles though).

With 136.889 (2018D MET data) step3 I reproduce a failure, although now with exceptions of

----- Begin Fatal Exception 11-Sep-2024 22:02:40 CEST-----------------------
An exception of category 'Configuration' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=PhotonMVAValueMapProducer label='photonMVAValueMapProducer'
Exception Message:
failed to parse ""
----- End Fatal Exception -------------------------------------------------

or

----- Begin Fatal Exception 11-Sep-2024 22:28:49 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=PhotonMVAValueMapProducer label='photonMVAValueMapProducer'
   Additional Info:
      [a] Fatal Root Error: @SUB=TFile::TFile
file /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_0_pre7/external/el8_amd64_gcc12/data/RecoEgamma/PhotonIdentification/data/MVA/Fall17/EE_V2.weights.root can not be opened for reading Too many open files

----- End Fatal Exception -------------------------------------------------

I tested the environment variables

  • ONEDNN_MAX_CPU_ISA=SSE41
  • ONEDNN_JIT_PROFILE=1
  • TF_ENABLE_ONEDNN_OPTS=0

separately and all together, and all resulted a failure.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 22, 2024

I started to wonder if a Tensorflow built with tensorflow_mkldnn_contraction_kernel=0 would be ABI compatible with our production TF build? (thinking feasibility of keeping the present TF+mkldnn setup in production, but allowing it to be easily switched to another build that would work with profilers)

Would it make sense (and work ?) to wrap the calls to TensorFlow with

#include <ittnotify.h>

...

__itt_pause();   // Pause profiling
// call to TensorFlow goes here
__itt_resume();  // Resume profiling

?

@fwyzard
Copy link
Contributor

fwyzard commented Oct 23, 2024

Mhm, no, because __itt_pause()/__itt_resume() affect the whole application, not just the calling thread :-/

@gartung
Copy link
Member

gartung commented Oct 23, 2024

I have a draft pull request for cmsdist that sets enable-tf-mkldnn in the rpm spec file
cms-sw/cmsdist#9471 (comment)
Running workflows 23834.21 and 12634.21 on cmsdev40 with export TF_ENABLE_ONEDNN_OPTS=1 has not produced any crashes.

@makortel
Copy link
Contributor

makortel commented Oct 24, 2024

With the CMSSW installation provided by cms-sw/cmsdist#9471 (comment) workflow 136.889 step 3 + VTune still fails for me, with or without TF_ENABLE_ONEDNN_OPTS=1.

@gartung I thought disabling the OneDNN JITting helped earlier?

@gartung
Copy link
Member

gartung commented Oct 25, 2024

I thought so too. Maybe it was a different tensor flow build option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants