Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor for post-unblinding data taking #74

Open
wants to merge 55 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
3284d61
add ann tier
ggmarshall Oct 9, 2024
26d52f2
allow more jobs
ggmarshall Oct 20, 2024
7918e83
pc cleanup
ggmarshall Oct 21, 2024
e9561bd
bump pkg versions
ggmarshall Oct 21, 2024
a3c0dae
add ml packages
ggmarshall Oct 21, 2024
818511d
refactor for new metadata, clean up patterns and some naming
ggmarshall Nov 27, 2024
41c326b
update rules for pattern changes
ggmarshall Nov 27, 2024
1698eb1
add debug mode functionality
ggmarshall Nov 27, 2024
b840444
os to pathlib.Path
ggmarshall Nov 27, 2024
323dd09
debugging
ggmarshall Nov 28, 2024
bbf65e9
move info from readme to docs
ggmarshall Nov 29, 2024
9639200
add ability to specify different file selections and cleanup
ggmarshall Dec 3, 2024
0cb28b6
updates for new meta, switch to detector keying in configs
ggmarshall Dec 3, 2024
3112518
merge ann rules
ggmarshall Dec 3, 2024
4f7e405
debugging
ggmarshall Dec 4, 2024
a2f2d7e
style: pre-commit fixes
pre-commit-ci[bot] Dec 4, 2024
ce2ad85
add isotopes where lines are from
ggmarshall Dec 5, 2024
bd9d596
Merge branch 'meta_refactor' of github.com:legend-exp/legend-dataflow…
ggmarshall Dec 5, 2024
2deac35
choose ctc based on no_ctc energy instead
ggmarshall Dec 5, 2024
97a0f8e
Fix a bunch of docs things
gipert Dec 26, 2024
4c6dffc
update blinding cal to new hpgecal
ggmarshall Dec 26, 2024
08e20e7
Try fixing RTD build
gipert Dec 27, 2024
1b68941
Merge branch 'meta_refactor' of github.com:legend-exp/legend-dataflow…
gipert Dec 27, 2024
603f3ec
Bug fix
gipert Dec 27, 2024
9f4d1c2
Remove unneeded sphinx ext
gipert Dec 27, 2024
1152316
add snakefile to profile
ggmarshall Dec 28, 2024
24fb2ed
add table format to config
ggmarshall Dec 28, 2024
c89b634
update to cal_groupings file
ggmarshall Dec 28, 2024
c5104b9
Merge branch 'meta_refactor' of github.com:legend-exp/legend-dataflow…
ggmarshall Dec 28, 2024
83fc329
add pyproject file
ggmarshall Dec 28, 2024
7cd0273
add logging config and cleanup config loading
ggmarshall Dec 31, 2024
59e273b
add param info to svm rule
ggmarshall Dec 31, 2024
2cc1232
move logging to function
ggmarshall Jan 8, 2025
72140e2
fix svm rules
ggmarshall Jan 8, 2025
5139f18
add dbetto dependency to configs
ggmarshall Jan 8, 2025
4dea274
Fix bugs in complete_run.py
gipert Jan 17, 2025
0c43924
Support using specialized build_raw script depending on DAQ extension
gipert Jan 17, 2025
8eba704
Updates to build_raw Snakefile to support latest dataflow changes
gipert Jan 17, 2025
e565e59
extension="*" does not work as expected, needs to be fixed in some ot…
gipert Jan 17, 2025
0be642f
Renaming, JIT compile daq2lh5 onstart
gipert Jan 18, 2025
4dcd0d2
Several fixes to build_raw.py scripts
gipert Jan 20, 2025
3c2a166
allow filelist globbing for daq fcio/orca files
ggmarshall Jan 20, 2025
378b82d
merges
ggmarshall Jan 20, 2025
1dcd027
have par catalog build support multiple file extensions, split out bu…
ggmarshall Jan 20, 2025
0438539
fix par catalog write
ggmarshall Jan 20, 2025
25a6183
fix daq filelist
ggmarshall Jan 20, 2025
325c920
allow filelist globbing for daq fcio/orca files
ggmarshall Jan 20, 2025
8197a3f
have par catalog build support multiple file extensions, split out bu…
ggmarshall Jan 20, 2025
48b326d
A lot of fixes in complete_run.py
gipert Jan 20, 2025
0b558dd
fix weird filelist len bug by moving to script
ggmarshall Jan 20, 2025
95f1759
Merge pull request #78 from legend-exp/fcio
gipert Jan 20, 2025
a43a9eb
merges
ggmarshall Jan 20, 2025
689164b
fix log import
ggmarshall Jan 20, 2025
2ac84b0
split out filelist write to workaround smk behaviour, cleanup catalog…
ggmarshall Jan 20, 2025
2c47ca9
Remove leftover print statements
gipert Jan 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ instance/
.scrapy

# Sphinx documentation
/docs/build/
/docs/_build/
/docs/source/generated

# PyBuilder
Expand Down Expand Up @@ -113,3 +113,5 @@ venv.bak/

# mypy
.mypy_cache/

docs/source/api
22 changes: 22 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
version: 2

build:
os: "ubuntu-22.04"
tools:
python: "3.12"
commands:
# FIXME: dependencies should not be explicitly listed here!
- asdf plugin add uv
- asdf install uv latest
- asdf global uv latest
- uv venv
- uv pip install .[docs]
- rm -rf docs/source/api
- .venv/bin/python -m sphinx.ext.apidoc
--private
--module-first
--force
--output-dir docs/source/api
scripts
- .venv/bin/python -m sphinx -T -b html -d docs/_build/doctrees -D
language=en docs/source $READTHEDOCS_OUTPUT/html
2 changes: 1 addition & 1 deletion .ruff.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ lint.select = [
"PIE", # flake8-pie
"PL", # pylint
"PT", # flake8-pytest-style
# "PTH", # flake8-use-pathlib
"PTH", # flake8-use-pathlib
"RET", # flake8-return
"RUF", # Ruff-specific
"SIM", # flake8-simplify
Expand Down
4 changes: 3 additions & 1 deletion LICENSE.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
The legend-dataflow-hades package is licensed under the MIT "Expat" License:
The legend-dataflow package is licensed under the MIT "Expat" License:

> Copyright (c) 2021:
>
> Matteo Agostini <[email protected]>
> Oliver Schulz <[email protected]>
> George Marshall <[email protected]>
> Luigi Pertoldi <[email protected]>
>
> Permission is hereby granted, free of charge, to any person obtaining a copy
> of this software and associated documentation files (the "Software"), to deal
Expand Down
112 changes: 0 additions & 112 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,115 +3,3 @@
Implementation of an automatic data processing flow for L200
data, based on
[Snakemake](https://snakemake.readthedocs.io/).


## Configuration

Data processing resources are configured via a single site-dependent (and
possibly user-dependent) configuration file, named `config.json` in the
following. You may choose an arbitrary name, though.

Use the included [templates/config.json](templates/config.json) as a template
and adjust the data base paths as necessary. Note that, when running Snakemake,
the default path to the config file is `./config.json`.


## Key-Lists

Data generation is based on key-lists, which are flat text files
(extension ".keylist") containing one entry of the form
`{experiment}-{period}-{run}-{datatype}-{timestamp}` per line.

Key-lists can be auto-generated based on the available DAQ files
using Snakemake targets of the form

* `all-{experiment}.keylist`
* `all-{experiment}-{period}.keylist`
* `all-{experiment}-{period}-{run}.keylist`
* `all-{experiment}-{period}-{run}-{datatype}.keylist`

which will generate the list of available file keys for all l200 files, resp.
a specific period, or a specific period and run, etc.

For example:
```shell
$ snakemake all-l200-myper.keylist
```
will generate a key-list with all files regarding period `myper`.


## File-Lists

File-lists are flat files listing output files that should be generated,
with one file per line. A file-list will typically be generated for a given
data tier from a key-list, using the Snakemake targets of the form
`{label}-{tier}.filelist` (generated from `{label}.keylist`).

For file lists based on auto-generated key-lists like
`all-{experiment}-{period}-{tier}.filelist`, the corresponding key-list
(`all-{experiment}-{period}.keylist` in this case) will be created
automatically, if it doesn't exist.

Example:
```shell
$ snakemake all-mydet-mymeas-tier2.filelist
```

File-lists may of course also be derived from custom keylists, generated
manually or by other means, e.g. `my-dataset-raw.filelist` will be
generated from `my-dataset.keylist`.


## Main output generation

Usually, the main output will be determined by a file-list, resp. a key-list
and data tier. The special output target `{label}-{tier}.gen` is used to
generate all files listed in `{label}-{tier}.filelist`. After the files
are created, the empty file `{label}-{tier}.filelist` will be created to
mark the successful data production.

Snakemake targets like `all-{experiment}-{period}-{tier}.gen` may be used
to automatically generate key-lists and file-lists (if not already present)
and produce all possible output for the given data tier, based on available
tier0 files which match the target.

Example:
```shell
$ snakemake all-mydet-mymeas-tier2.gen
```
Targets like `my-dataset-raw.gen` (derived from a key-list
`my-dataset.keylist`) are of course allowed as well.


## Monitoring

Snakemake supports monitoring by connecting to a
[panoptes](https://github.com/panoptes-organization/panoptes) server.

Run (e.g.)
```shell
$ panoptes --port 5000
```
in the background to run a panoptes server instance, which comes with a
GUI that can be accessed with a web-brower on the specified port.

Then use the Snakemake option `--wms-monitor` to instruct Snakemake to push
progress information to the panoptes server:
```shell
snakemake --wms-monitor http://127.0.0.1:5000 [...]
```

## Using software containers

This dataflow doesn't use Snakemake's internal Singularity support, but
instead supports Singularity containers via
[`venv`](https://github.com/oschulz/singularity-venv) environments
for greater control.

To use this, the path to `venv` and the name of the environment must be set
in `config.json`.

This is only relevant then running Snakemake *outside* of the software
container, e.g. then using a batch system (see below). If Snakemake
and the whole workflow is run inside of a container instance, no
container-related settings in `config.json` are required.
124 changes: 53 additions & 71 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,17 @@ This includes:
- the same for partition level tiers
"""

import pathlib
from pathlib import Path
import os
import json
import sys
import glob
from datetime import datetime
from collections import OrderedDict
import logging

import scripts.util as ds
from scripts.util.pars_loading import pars_catalog
from scripts.util.patterns import get_pattern_tier_raw
from scripts.util.pars_loading import ParsCatalog
from scripts.util.patterns import get_pattern_tier
from scripts.util.utils import (
subst_vars_in_snakemake_config,
runcmd,
Expand All @@ -31,6 +30,7 @@ from scripts.util.utils import (
metadata_path,
tmp_log_path,
pars_path,
det_status_path,
)

# Set with `snakemake --configfile=/path/to/your/config.json`
Expand All @@ -43,8 +43,9 @@ setup = config["setups"]["l200"]
configs = config_path(setup)
chan_maps = chan_map_path(setup)
meta = metadata_path(setup)
det_status = det_status_path(setup)
swenv = runcmd(setup)
part = ds.dataset_file(setup, os.path.join(configs, "partitions.json"))
part = ds.CalGrouping(setup, Path(det_status) / "cal_groupings.yaml")
basedir = workflow.basedir


Expand All @@ -66,38 +67,13 @@ include: "rules/psp.smk"
include: "rules/hit.smk"
include: "rules/pht.smk"
include: "rules/pht_fast.smk"
include: "rules/ann.smk"
include: "rules/evt.smk"
include: "rules/skm.smk"
include: "rules/blinding_calibration.smk"
include: "rules/qc_phy.smk"


# Log parameter catalogs in validity.jsonl files
hit_par_cat_file = os.path.join(pars_path(setup), "hit", "validity.jsonl")
if os.path.isfile(hit_par_cat_file):
os.remove(os.path.join(pars_path(setup), "hit", "validity.jsonl"))
pathlib.Path(os.path.dirname(hit_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(hit_par_catalog, hit_par_cat_file)

pht_par_cat_file = os.path.join(pars_path(setup), "pht", "validity.jsonl")
if os.path.isfile(pht_par_cat_file):
os.remove(os.path.join(pars_path(setup), "pht", "validity.jsonl"))
pathlib.Path(os.path.dirname(pht_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(pht_par_catalog, pht_par_cat_file)

dsp_par_cat_file = os.path.join(pars_path(setup), "dsp", "validity.jsonl")
if os.path.isfile(dsp_par_cat_file):
os.remove(dsp_par_cat_file)
pathlib.Path(os.path.dirname(dsp_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(dsp_par_catalog, dsp_par_cat_file)

psp_par_cat_file = os.path.join(pars_path(setup), "psp", "validity.jsonl")
if os.path.isfile(psp_par_cat_file):
os.remove(psp_par_cat_file)
pathlib.Path(os.path.dirname(psp_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(psp_par_catalog, psp_par_cat_file)


localrules:
gen_filelist,
autogen_output,
Expand All @@ -111,36 +87,48 @@ onstart:
shell('{swenv} python3 -B -c "import ' + pkg + '"')

# Log parameter catalogs in validity.jsonl files
hit_par_cat_file = os.path.join(pars_path(setup), "hit", "validity.jsonl")
if os.path.isfile(hit_par_cat_file):
os.remove(os.path.join(pars_path(setup), "hit", "validity.jsonl"))
pathlib.Path(os.path.dirname(hit_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(hit_par_catalog, hit_par_cat_file)

pht_par_cat_file = os.path.join(pars_path(setup), "pht", "validity.jsonl")
if os.path.isfile(pht_par_cat_file):
os.remove(os.path.join(pars_path(setup), "pht", "validity.jsonl"))
pathlib.Path(os.path.dirname(pht_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(pht_par_catalog, pht_par_cat_file)

dsp_par_cat_file = os.path.join(pars_path(setup), "dsp", "validity.jsonl")
if os.path.isfile(dsp_par_cat_file):
os.remove(dsp_par_cat_file)
pathlib.Path(os.path.dirname(dsp_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(dsp_par_catalog, dsp_par_cat_file)

psp_par_cat_file = os.path.join(pars_path(setup), "psp", "validity.jsonl")
if os.path.isfile(psp_par_cat_file):
os.remove(psp_par_cat_file)
pathlib.Path(os.path.dirname(psp_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(psp_par_catalog, psp_par_cat_file)
hit_par_cat_file = Path(pars_path(setup)) / "hit" / "validity.yaml"
if hit_par_cat_file.is_file():
hit_par_cat_file.unlink()
try:
Path(hit_par_cat_file).parent.mkdir(parents=True, exist_ok=True)
ParsKeyResolve.write_to_yaml(hit_par_catalog, hit_par_cat_file)
except NameError:
print("No hit parameter catalog found")

pht_par_cat_file = Path(pars_path(setup)) / "pht" / "validity.yaml"
if pht_par_cat_file.is_file():
pht_par_cat_file.unlink()
try:
Path(pht_par_cat_file).parent.mkdir(parents=True, exist_ok=True)
ParsKeyResolve.write_to_yaml(pht_par_catalog, pht_par_cat_file)
except NameError:
print("No pht parameter catalog found")

dsp_par_cat_file = Path(pars_path(setup)) / "dsp" / "validity.yaml"
if dsp_par_cat_file.is_file():
dsp_par_cat_file.unlink()
try:
Path(dsp_par_cat_file).parent.mkdir(parents=True, exist_ok=True)
ParsKeyResolve.write_to_yaml(dsp_par_catalog, dsp_par_cat_file)
except NameError:
print("No dsp parameter catalog found")

psp_par_cat_file = Path(pars_path(setup)) / "psp" / "validity.yaml"
if psp_par_cat_file.is_file():
psp_par_cat_file.unlink()
try:
Path(psp_par_cat_file).parent.mkdir(parents=True, exist_ok=True)
ParsKeyResolve.write_to_yaml(psp_par_catalog, psp_par_cat_file)
except NameError:
print("No psp parameter catalog found")


onsuccess:
from snakemake.report import auto_report

rep_dir = f"{log_path(setup)}/report-{datetime.strftime(datetime.utcnow(), '%Y%m%dT%H%M%SZ')}"
pathlib.Path(rep_dir).mkdir(parents=True, exist_ok=True)
Path(rep_dir).mkdir(parents=True, exist_ok=True)
# auto_report(workflow.persistence.dag, f"{rep_dir}/report.html")

with open(os.path.join(rep_dir, "dag.txt"), "w") as f:
Expand Down Expand Up @@ -181,26 +169,20 @@ onsuccess:
rule gen_filelist:
"""Generate file list.

It is a checkpoint so when it is run it will update the dag passed on the
files it finds as an output. It does this by taking in the search pattern,
using this to find all the files that match this pattern, deriving the keys
from the files found and generating the list of new files needed.
This rule is used as a "checkpoint", so when it is run it will update the
DAG based on the files it finds. It does this by taking in the search
pattern, using this to find all the files that match this pattern, deriving
the keys from the files found and generating the list of new files needed.
"""
input:
lambda wildcards: get_filelist(
wildcards,
setup,
get_pattern_tier_raw(setup),
ignore_keys_file=os.path.join(configs, "ignore_keys.keylist"),
analysis_runs_file=os.path.join(configs, "analysis_runs.json"),
get_search_pattern(wildcards.tier),
ignore_keys_file=Path(det_status) / "ignored_daq_cycles.yaml",
analysis_runs_file=Path(det_status) / "runlists.yaml",
),
output:
os.path.join(filelist_path(setup), "{label}-{tier}.filelist"),
run:
if len(input) == 0:
print(
"WARNING: No files found for the given pattern\nmake sure pattern follows the format: all-{experiment}-{period}-{run}-{datatype}-{timestamp}-{tier}.gen"
)
with open(output[0], "w") as f:
for fn in input:
f.write(f"{fn}\n")
temp(Path(filelist_path(setup)) / "{label}-{tier}.filelist"),
script:
"scripts/write_filelist.py"
Loading