Skip to content

Hextof lab loader #534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Hextof lab loader #534

wants to merge 24 commits into from

Conversation

zain-sohail
Copy link
Member

@zain-sohail zain-sohail commented Dec 19, 2024

This PR adds the lab loader requested in #503 . I tried to make minimal changes to the FlashLoader to make this work. The only major addition is the loader specific dataframe class and everything else stays approximately the same. So the lab data works with the flash loader but withbeamline config as cfel.

An example config is provided to make this work. Since I took out some hardcoded paramters (was in TODO) into the config, I updated the config model slightly.
Test data for this loading configuration still needs to be setup. I ask @kutnyakhov to provide a public file to perform this. Not sure if a tutorial is necessary or not.

@zain-sohail zain-sohail changed the base branch from main to v1_feature_branch December 19, 2024 12:46
@zain-sohail zain-sohail marked this pull request as ready for review December 19, 2024 15:48
else:
raise ValueError(f"Unsupported core beamline: {core_beamline}")

def _validate_h5_files(self, config, h5_paths: list[Path]) -> list[Path]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation was previously in BufferFilePaths and we had a discussion to move it from there. I find this location better (also was necessary due to restructure)

Comment on lines -4 to -7
# TODO: move to config
MULTI_INDEX = ["trainId", "pulseId", "electronId"]
PULSE_ALIAS = MULTI_INDEX[1]
FORMATS = ["per_electron", "per_pulse", "per_train"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these have now been moved to config/config model

@coveralls
Copy link
Collaborator

coveralls commented Jan 16, 2025

Pull Request Test Coverage Report for Build 13419398366

Details

  • 71 of 124 (57.26%) changed or added relevant lines in 7 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.6%) to 91.6%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/sed/loader/flash/loader.py 3 4 75.0%
src/sed/loader/flash/utils.py 9 10 90.0%
src/sed/loader/flash/buffer_handler.py 30 34 88.24%
src/sed/loader/flash/dataframe.py 22 69 31.88%
Files with Coverage Reduction New Missed Lines %
src/sed/binning/numba_bin.py 3 87.62%
Totals Coverage Status
Change from base Build 13167417292: -0.6%
Covered Lines: 7731
Relevant Lines: 8440

💛 - Coveralls

@zain-sohail zain-sohail mentioned this pull request Jan 30, 2025
12 tasks
@rettigl rettigl changed the base branch from v1_feature_branch to main February 5, 2025 21:58
@zain-sohail zain-sohail requested a review from Copilot April 6, 2025 16:10
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 10 out of 11 changed files in this pull request and generated no comments.

Files not reviewed (1)
  • .cspell/custom-dictionary.txt: Language not supported
Comments suppressed due to low confidence (2)

src/sed/loader/flash/dataframe.py:423

  • The docstring for the df_train property refers to channels of type [per pulse], but the implementation uses 'per_train'. Please update the comment to match the code and maintain clarity.
        Returns a pandas DataFrame for given channel names of type [per pulse]

tests/data/loader/flash/config.yaml:57

  • [nitpick] For consistency and to avoid potential YAML parsing issues, consider quoting the index values as strings (e.g. ['trainId', 'pulseId', 'electronId']).
  index: [trainId, pulseId, electronId]

Copy link
Member Author

@zain-sohail zain-sohail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just few comments. Is this local metadata scheme also important for flash loader? Because then the code also needs to be updated there.

@@ -26,6 +26,7 @@ class PathsModel(BaseModel):

raw: DirectoryPath
processed: Optional[Union[DirectoryPath, NewPath]] = None
meta: Optional[Union[DirectoryPath, NewPath]] = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a new entry to the config model, I'd suggest we just allow directory paths in

archiver_url: Optional[HttpUrl] = None

what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me. I just thought as it anyway would be one of the main folders inside the beamtime folder.

processed_dir = Path(
self._config["core"]["paths"].get("processed", raw_dir.joinpath("processed")),
)
meta_dir = Path(
self._config["core"]["paths"].get("meta", raw_dir.joinpath("meta")),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path logic is confusing right now as there is too many possibilities. I'd put the default as archiver_url in lab default config, and one automatic option.
To me its not clear if the meta path is 'meta/' or 'meta/fabtrack/' right now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is also confusing for me, as don't really see how you can get from raw_dir to e.g. processed_dir with raw_dir.joinpath("processed") - because this will give you beamtime_dir/raw_dir/processed instead of beamtime_dir/processed, or?
Currently, meta path is 'meta/fabtrack/' as it comes from Fabiano's code, but probably can be changed just to 'meta/' as soon as it will be accepted/generalized by IT guys.

self.metadata.update(self.parse_local_metadata())
else:
print("Metadata taken from SciCat")
self.metadata.update(self.parse_scicat_metadata(token) if collect_metadata else {})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily a big issue but the parse_scicat_metadata is called twice in case it exists, once during if and once during else.
One way could be:

scicat_metadata = self.parse_scicat_metadata(token) if collect_metadata else {})
self.metadata.update(scicat_metadata)
if len(scicat_metadata) == 0:
    print("No SciCat metadata available, checking local folder")
    self.metadata.update(self.parse_local_metadata())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me. Just wanted to implement check if SciCat entries available then go for it, if not then check local folder to be compatible to older beamtimes.

burl=self.url,
url="Datasets",
url="datasets",#"Datasets",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the api change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all metadata was migrated to generalized scicat.desy.de with new api where 'Datasets' were changed to 'datasets' :)
Hopefully within next days it should be also available from outside DESY.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants