Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of TICL EnergyRegression and PID model #45821

Conversation

Moanwar
Copy link
Contributor

@Moanwar Moanwar commented Aug 28, 2024

This PR adds CNN models trained using TICLv5 reconstruction information through a simple CNN-based approach. Two separate models were developed: one for trackster energy regression and another for particle ID. These models are integrated into different reconstruction steps, specifically CLU3D (pattern recognition) and trackster linking. The models are saved in ONNX format to optimize processing time.

Additionally, we have moved the inference to the TracksterProducer and TracksterLinksProducer, instead of running the inference within the algorithm. This was achieved by creating a new TICL inference plugin system, which allows for the inclusion of more models without conflicting with the old ones. We have also converted the old TensorFlow model of TICLv4 used for PID and regression to ONNX format and created a separate inference plugin, TracksterInferenceByCNNv4.

-This PR needs to be tested with the following cms-data PR: cms-data/RecoHGCal-TICL#6.

-Additionally, this PR should be tested on the .203 workflow :

-test parameters:
workflow_opts= -w upgrade
workflow = 29888.203,29688.203

Tagging @felicepantaleo @waredjeb @hatakeyamak

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 28, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @Moanwar for master.

It involves the following packages:

  • HLTrigger/Configuration (hlt)
  • RecoHGCal/TICL (reconstruction, upgrade)

@Martin-Grunewald, @cmsbuild, @jfernan2, @mandrenguyen, @mmusich, @srimanob, @subirsarkar can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @SohamBhattacharya, @VourMa, @apsallid, @felicepantaleo, @forthommel, @hatakeyamak, @lecriste, @missirol, @mmusich, @rovere, @sameasy, @silviodonato, @sobhatta, @youyingli this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Aug 28, 2024

test parameters:

@felicepantaleo
Copy link
Contributor

@cmsbuild please test

RecoHGCal/TICL/interface/TracksterInferenceByCNNv4.h Outdated Show resolved Hide resolved
RecoHGCal/TICL/interface/TracksterInferenceByDNN.h Outdated Show resolved Hide resolved
@@ -0,0 +1,22 @@
#ifndef RecoHGCal_TICL_TracksterInferenceByANN_H__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this plugin? It is also used in the configurations above, but its implementation is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This additional empty plugin is set up in anticipation of future models that might arrive later, so we prepared the configuration for it in advance.

minNumLayerCluster = cms.int32(5),
type = cms.string('FastJet')
),
pluginInferenceAlgoTracksterInferenceByDNN = cms.PSet(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. I see that here you use the pluginInferenceAlgoTracksterInferenceByDNN with a TICLv4 model, while in the TracksterProducer and the linking variant, there is also the plugin TracksterInferenceByCNNv4. If we configure the DNN one, should we expect regression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Marco, so this additional empty plugin is set up in anticipation of future models that might arrive later, so we prepared the configuration for it in advance. Thanks!

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 12KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7fb892/41549/summary.html
COMMIT: c0d98f4
CMSSW: CMSSW_14_2_X_2024-09-16-2300/el8_amd64_gcc12
Additional Tests: HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45821/41549/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7fb892/41549/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7fb892/41549/git-merge-result

  • DAS Queries: The DAS query tests failed, see the summary page for details.

  • HLT P2 Timing: chart

Comparison Summary

Summary:

  • You potentially added 696 lines to the logs
  • Reco comparison results: 2175 differences found in the comparisons
  • DQMHistoTests: Total files compared: 46
  • DQMHistoTests: Total histograms compared: 3532200
  • DQMHistoTests: Total failures: 522
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3531657
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 45 files compared)
  • DQMHistoSizes: changed ( 29888.203 ): 0.004 KiB MessageLogger/Warnings
  • Checked 202 log files, 171 edm output root files, 46 DQM output files
  • TriggerResults: found differences in 4 / 44 workflows

@mmusich
Copy link
Contributor

mmusich commented Sep 18, 2024

HLT P2 Timing: chart

the phase-2 HLT timing shows a rather hefty reduction in timing from 5547.1 ms/ev base to 5355.9 ms/ev target, which is a -3.4% reduction. Is it what you expect from the changes proposed here?

@Moanwar
Copy link
Contributor Author

Moanwar commented Sep 18, 2024

Hi @mmusich, Indeed, this is one of the expected outcomes intended by this new plugin system. First, it allows the dynamic use of different ML models for PID and regression based on the stages, enabling smooth transitions between them as well. Additionally, using ONNX model formats helps optimize prediction time, making this a good result. Thanks!

@mmusich
Copy link
Contributor

mmusich commented Sep 18, 2024

+hlt

@jfernan2
Copy link
Contributor

+1

@Moanwar
Copy link
Contributor Author

Moanwar commented Sep 24, 2024

+Upgrade

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @mandrenguyen, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)
Notice This PR was tested with additional Pull Request(s), please also merge them if necessary: cms-data/RecoHGCal-TICL#6

@antoniovilela
Copy link
Contributor

Merged cms-data/RecoHGCal-TICL#6
cmsdist: cms-sw/cmsdist#9435

@antoniovilela
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 715dac7 into cms-sw:master Sep 25, 2024
13 checks passed
@mmusich
Copy link
Contributor

mmusich commented Sep 26, 2024

@Moanwar @cms-sw/hcal-dpg-l2

I suspect this PR might be responsible of widespread failures across builds of the wf 32034.0_TTbar_14TeV+2026D115, see e.g. log:

Thread 5 (Thread 0x1499a37ff700 (LWP 776810) "cmsRun"):
#0  0x00001499fa343ac1 in poll () from /lib64/libc.so.6
#1  0x00001499f6201527 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02856/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_X_2024-09-25-2300/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00001499f6201724 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02856/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_X_2024-09-25-2300/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00001499257fe9fa in ticl::PatternRecognitionbyCA<TICLGenericTile<std::array<TICLLayerTileT<ticl::TileConstantsHFNose>, 16ul> > >::filter(std::vector<ticl::Trackster, std::allocator<ticl::Trackster> >&, std::vector<ticl::Trackster, std::allocator<ticl::Trackster> > const&, ticl::PatternRecognitionAlgoBaseT<TICLGenericTile<std::array<TICLLayerTileT<ticl::TileConstantsHFNose>, 16ul> > >::Inputs const&, std::unordered_map<int, std::vector<int, std::allocator<int> >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > >&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02856/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_X_2024-09-25-2300/lib/el8_amd64_gcc12/pluginRecoHGCalTICLPlugins.so
#5  0x0000149925888630 in TrackstersProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02856/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_X_2024-09-25-2300/lib/el8_amd64_gcc12/pluginRecoHGCalTICLPlugins.so
#6  0x00001499fcdd2e93 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02856/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_X_2024-09-25-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#7  0x00001499fcdba4dc in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02856/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_X_2024-09-25-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#8  0x00001499fcd3e929 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02856/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_X_2024-09-25-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#9  0x00001499fcd3ee21 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02856/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_X_2024-09-25-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#10 0x00001499fcac0238 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02856/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_X_2024-09-25-2300/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#11 0x00001499fb4b2b3b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x1498f0b26800, waiter=..., this=0x1499f8bc9500) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#12 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x1499f8bc9500) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#13 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/arena.cpp:137
#14 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/market.cpp:599
#15 0x00001499fb4b4cee in tbb::detail::r1::rml::private_worker::run (this=0x1499f6b74000) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#16 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x1499f6b74000) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-2391c941213c757dc9a1835b31681235/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#17 0x00001499fa5ef1ca in start_thread () from /lib64/libpthread.so.0
#18 0x00001499fa24a8d3 in clone () from /lib64/libc.so.6

can you have a look?

@felicepantaleo
Copy link
Contributor

felicepantaleo commented Sep 27, 2024

@Moanwar it's HFnose reconstruction which is not maintained since a long time. But this does not mean we should break it ;-)

You can test it in d115.


)

from Configuration.ProcessModifiers.ticl_v5_cff import ticl_v5
ticl_v5.toModify(ticlTrackstersCLUE3DHigh.pluginPatternRecognitionByCLUE3D, computeLocalTime = cms.bool(True))
ticl_v5.toModify(ticlTrackstersCLUE3DHigh.pluginPatternRecognitionByCLUE3D, usePCACleaning = cms.bool(True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was usePCACleaning set to False?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sameasy fyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants