Name	Name	Last commit message	Last commit date
parent directory ..
ATLAS_Higgs_opendata	ATLAS_Higgs_opendata
CMS_Higgs_opendata	CMS_Higgs_opendata
Dimuon_mass_spectrum	Dimuon_mass_spectrum
HEP_benchmark	HEP_benchmark
LHCb_opendata	LHCb_opendata
misc	misc
README.md	README.md
Spark_Root_data_preparation.md	Spark_Root_data_preparation.md
Uproot_example.md	Uproot_example.md

Apache Spark for High Energy Physics

This collects a few simple examples of how Apache Spark can be used in the domain of High Energy Physics data analysis.
Most of the examples are just for education purposes, use a small subset of data and can be run on laptop-sized computing resources.
See also the blog post Can High Energy Physics Analysis Profit from Apache Spark APIs?

1. Dimuon mass spectrum analysis

This is a sort of "Hello World!" example for High Energy Physics analysis.
The implementations proposed here using Apache Spark APIs are a direct "Spark translation" of a tutorial using ROOT DataFrame

Data

These examples use CMS open data from 2012:
- DOI: 10.7483/OPENDATA.CMS.YLIC.86ZZ and DOI: 10.7483/OPENDATA.CMS.M5AD.Y3V3)
Data has been converted and made available for this work in snappy-compressed Apache Parquet and Apache ORC formats
You can download the following datasets:
- 61 million events (2 GB)
  - original files in ROOT format: root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root
    - see notes on how to access data using the XRootD protocol (root://) and how to read it.
  - dataset converted to Parquet: Run2012BC_DoubleMuParked_Muons.parquet
  - dataset converted to ORC: Run2012BC_DoubleMuParked_Muons.orc
- 6.5 billion events (200 GB, this is the 2GB dataset repeated 105 times)
  - original files, in ROOT format root://eospublic.cern.ch//eos/root-eos/benchmark/CMSOpenDataDimuon
  - dataset converted to Parquet: CMSOpenDataDimuon_large.parquet
    - download using wget -r -np -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/CMSOpenDataDimuon_large.parquet/
  - dataset converted to ORC: CMSOpenDataDimuon_large.orc
    - download using wget -r -np -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/CMSOpenDataDimuon_large.orc/

Notebooks

Multiple notebook solutions are provided, to illustrate different approaches with Apache Spark.
Notes on the execution environment:

The notebooks use the dataset with 61 million events (Except the SCALE test that uses 6.5 billion events)
Compatibility and tests: these notebooks have been developed with Spark 3.2.1 and tested up to Spark 3.5.1
- For this we use Apache Spark vectorized reader for complex types in Parquet, mapInArrow UDF (introduced in Spark 3.3.0).
- The server used for testing and measuring the execution time has 4 physical CPU cores, SSD disk storage , and 32 GB of RAM (which is more than the minimum requirements for running this workload)

Notebook	Run Time	Short description
1. DataFrame API, Parquet	11 sec	The analysis is implemented using Apache Spark DataFrame API. This uses the dataset in Apache Parquet format
2. DataFrame API, ORC	11 sec	Same as 1a., with the exception that this uses the dataset in Apache ORC format
3. Spark SQL, Parquet	11 sec	The analysis is implemented using Spark SQL
4. Scala UDF	12 sec	Implementation that mixes DataFrame API and Scala UDF. The dimuon invariant mass formula computation is implemented using a UDF written in Scala. Link to Scala UDF code
5a. Pandas UDF flattened data	19 sec	Implementation that mixes DataFrame API and Python Pandas UDF. The dimuon invariant mass formula computation is implemented using a Pandas UDF
5b. Pandas UDF data arrays	82 sec	Same as 5a, but the Pandas UDF in this case uses data in arrays and lists
6. MapInArrow flattened data	22 sec	This uses mapInArrow, introduced in SPARK-37227
7. RumbleDB on Spark	155 sec	This implementation runs with RumbleDB query engine on top of Apache Spark. RumbleDB implements the JSONiq language.
8. DataFrame API at scale, Parquet	(*)35 sec	(*)This has processed 6.5 billion events, at scale on a cluster using 200 CPU cores.
Dimuon spectrum analysis on Colab	-	You can run this on Google's Colaboratory
Dimuon spectrum analysis on CERN SWAN	-	You can run this on CERN SWAN (requires CERN SSO credentials)

2. HEP analysis benchmark

Here you will find an implementation of the High Energy Physics benchmark tasks using Apache Spark.
It follows the IRIS-HEP benchmark specifications and solutions linked there.
Solutions to the benchmark tasks are also directly inspired by the article Evaluating Query Languages and Systems for High-Energy Physics Data.

Data

For this exercise we use CMS open data from 2012, made available via the CERN opendata portal: DOI:10.7483/OPENDATA.CMS.IYVQ.1J0W
Data has been converted and made available for this work in snappy-compressed Apache Parquet and Apache ORC formats
A list of the downloadable datasets for this analysis:
53 million events (16 GB), original files in ROOT format: root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root
- see notes on how to access data using the XRootD protocol (root://) and how to read it.
53 million events (16 GB), converted to Parquet: Run2012BC_DoubleMuParked_Muons.parquet
- download using wget -r -np -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/Run2012B_SingleMu.parquet/
7 million events (2 GB) Parquet format Run2012B_SingleMu_sample.parquet

Notebooks

Notes on the execution environment:

The notebooks can be run using the datasets with 53 million events for more accuracy or with the reduced dataset with 7 million events for faster execution
- Compatibility and tests: these notebooks have been developed with Spark 3.2.1 and tested up to Spark 3.5.1
- Notable Spark features used: Apache Spark vectorized reader for complex types in Parquet, mapInArrow UDF.
- The server used for testing and measuring the execution time has 4 physical CPU cores, SSD disk storage , and 32 GB of RAM (which is more than the minimum requirements for running this workload)

Notebook	Short description
Benchmark tasks 1 to 5, Parquet and SparkHistogram	The analysis is implemented using Apache Spark DataFrame API. It uses the dataset in Apache Parquet format and the histogram generation is done using the sparkhistogram package.
Benchmark task 6	Three different solution are provided. This is the hardest task to implement in Spark. The proposed solutions use also Scala UDFs: link to the Scala UDF code.
Benchmark task 7	Two different solutions provided, one using the explode function, the other with Spark's higher order functions for array processing.
Benchmark task 8	This combines Spark DataFrame API for filtering and Scala UDFs for processing. Link to the Scala UDF code.
Benchmark tasks 1 to 5 on Colab	You can run this on Google's Colaboratory.
Benchmark tasks 1 to 5 on CERN SWAN	You can run this on CERN SWAN (requires CERN SSO credentials).

3. ATLAS Higgs boson analysis - outreach-style

This is an example analysis of the Higgs boson detection via the decay channel H → ZZ* → 4l From the decay products measured at the ATLAS experiment and provided as open data, you will be able to produce a few histograms, comparing experimental data and Monte Carlo (simulation) data. From there you can infer the invariant mass of the Higgs boson.
Disclaimer: this is for educational purposes only, it is not the code nor the data of the official Higgs boson discovery paper.
It is based on the original work on ATLAS outreach notebooks and derived work at this repo and this work Reference: ATLAS paper on the discovery of the Higgs boson (mostly Section 4 and 4.1)

Data

The original data in ROOT format is from the ATLAS Open Datasets
- direct link: ATLAS open data events selected with at least four leptons (electron or muon)
The notebooks presented here use datasets from the original open data events converted to snappy-compressed Apache Parquet format.
- Download from: ATLAS Higgs notebook opendata in Parquet format
  - download all files (200 MB) using wget -r -np -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/ATLAS_Higgs_opendata/

Notebooks

Compatibility and tests: these notebooks have been developed with Spark 3.2.1 and tested up to Spark 3.5.1
These analyses use a very small dataset and are mostly intended to show how the Spark API can be applied in this context, rather than its performance and scalability.

Notebook	Short description
*1. ATLAS opendata Higgs H-ZZ-4l basic analysis**	Basic analysis with experiment (detector) data.
*2. H-ZZ-4l basic analysis with additional filters and cuts**	Analysis with extra cuts and data operations, this uses Monte Carlo (simulation) data.
*3. H-ZZ-4l reproduce Fig 2 of the paper - with experiment data and monte carlo**	Analysis with extra cuts and data operations, this uses experiment (detector) data and Monte Carlo (simulation) data for signal and backgroud. It roughly reproduces Figure 2 of the ATLAS Higgs paper.
*Run ATLAS opendata Higgs H-ZZ-4l basic analysis on Colab**	Basic analysis with experiment (detector) data. This notebook opens on Google's Colab.
*Run ATLAS opendata Higgs H-ZZ-4l basic analysis on CERN SWAN**	Basic analysis with experiment (detector) data. This notebook opens on CERN's SWAN notebook service (requires CERN SSO credentials)

4. CMS Higgs boson analysis - outreach-style

This is an example analysis of the Higgs boson detection via the decay channel H → ZZ* → 4l Disclaimer: this is for educational purposes only, it is not the code nor the data of the official Higgs boson discovery paper.
It is based on the original work on cms opendata notebooks and this derived work
Reference: link to the original article with CMS Higgs boson discovery

Data

The notebooks presented here use datasets from the original open data events converted to snappy-compressed Apache Parquet format.
- Download from: CMS Higgs notebook opendata in Parquet format
- CLI command to download all files (12 GB):
  wget -r -np -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/CMS_Higgs_opendata/
- The original data is from CMS open data and is stored in ROOT format
  - See notes on how to access data using the XRootD protocol (root://) and how to read ROOT data, download from this URL: root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod

Notebooks

Notebook	Short description
*CMS opendata Higgs H-ZZ-4l, basic with monte carlo signal**	Basic analysis and visualization with Monte Carlo (simulation) data.
TBC

5. LHCb matter antimatter asymmetries analysis - outreach-style

This notebook provides an example of how to use Spark to perform a simple analysis using high energy physics data from a LHC experiment. Credits:

The original text of this notebook, including all exercises, analysis, explanations and data have been developed by the LHCb collaboration and are authored and shared by the LHCb collaboration in their open data and outreach efforts. See links:

Data

The notebook presented here uses datasets in Apache Parquet format:
- (1.2 GB) download from: LHCb_opendata_notebook_data
- download all files using wget -r -np -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/LHCb_opendata/
The original work uses LHCb open data made available via the CERN opendata portal: PhaseSpaceSimulation.root, B2HHH_MagnetDown.root B2HHH_MagnetUp.root

Notebooks

Notebook	Short description
LHCb outreach analysis	LHCb analysis notebook using open data
Run LHCb opendata analysis notebook on Colab	This notebook opens on Google's Colab
Run LHCb opendata analysis notebook on CERN SWAN	This notebook opens on CERN SWAN notebook service (requires CERN SSO credentials)

Notes on reading and converting data from ROOT format

How to convert from ROOT format to Apache Parquet or ORC

High Energy Physics uses the ROOT data format extensively. Here are a few notes on how to read and convert data from ROOT format to Apache Parquet or ORC:

Spark and the Laurelin library, as detailed in note on converting from ROOT format
Python toolkits, notably uproot and awkward arrays, as in this example of how to use uproot

How to read files via the XRootD protocol

This allows to read files from URLs like root://eospublic.cern.ch/..
- It is used extensively for HEP data
- CERN users can read files stored in EOS also using CERNBOx
- Use Apache Spark to read root:// URLs using the Hadoop-XRootD connector
- Use CLI to access XRootD using the toolset from XRootD project
  - CLI example: xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
- CERN users can read files stored in EOS also using CERNBOx

Spark ML examples with Physics data

A few related references on the topic of ML tools used with HEP data:

Machine learning is a key technique in HEP, here are a couple of basic examples on how to build classifiers for HEP events
Higgs boson classifier
Particle classfier

Physics references

A few links with additional details on the terms and formulas used:

https://github.com/iris-hep/adl-benchmarks-index/blob/master/reference.md
http://edu.itp.phys.ethz.ch/hs10/ppp1/2010_11_02.pdf
https://en.wikipedia.org/wiki/Invariant_mass
"Facts And Mysteries In Elementary Particle Physics", by Martinus Veltman, WSPC, 2018.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark_Physics

Spark_Physics

README.md

Apache Spark for High Energy Physics

Contents:

1. Dimuon mass spectrum analysis

Data

Notebooks

2. HEP analysis benchmark

Data

Notebooks

3. ATLAS Higgs boson analysis - outreach-style

Data

Notebooks

4. CMS Higgs boson analysis - outreach-style

Data

Notebooks

5. LHCb matter antimatter asymmetries analysis - outreach-style

Data

Notebooks

Notes on reading and converting data from ROOT format

How to convert from ROOT format to Apache Parquet or ORC

How to read files via the XRootD protocol

Spark ML examples with Physics data

Physics references

Files

Spark_Physics

Directory actions

More options

Directory actions

More options

Latest commit

History

Spark_Physics

Folders and files

parent directory

README.md

Apache Spark for High Energy Physics

Contents:

1. Dimuon mass spectrum analysis

Data

Notebooks

2. HEP analysis benchmark

Data

Notebooks

3. ATLAS Higgs boson analysis - outreach-style

Data

Notebooks

4. CMS Higgs boson analysis - outreach-style

Data

Notebooks

5. LHCb matter antimatter asymmetries analysis - outreach-style

Data

Notebooks

Notes on reading and converting data from ROOT format

How to convert from ROOT format to Apache Parquet or ORC

How to read files via the XRootD protocol

Spark ML examples with Physics data

Physics references