Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide complete example workflow #34

Open
2 of 8 tasks
dlersch opened this issue Apr 18, 2024 · 3 comments
Open
2 of 8 tasks

Provide complete example workflow #34

dlersch opened this issue Apr 18, 2024 · 3 comments
Assignees
Milestone

Comments

@dlersch
Copy link
Contributor

dlersch commented Apr 18, 2024

1.) We need a fully functional example workflow for the HUGS tutorial. The workflow needs to have:

  • Data Parser Module
  • Data Prep Module
  • Model Module
  • Analysis Module

2.) For each module there has to be a proof of:

  • Issue
  • branch
  • unit-test
  • pull-request

3.) A good practice is to capture code-development, updates or any work in the issues. For example: "Started to implement module XYZ. Faced problem with so und so. Going to pause and run a quick literature search". This helps to keep everything transparent. Ideally, an issue tells the entire story of the work that has been done.

4.) The issues for each module should be linked to this. If we for example decide to use the CSVToPandasParser , then we should link the corresponding issue here. Same goes for wiki-pages

5.) For the sake of efficiency and time management, we should follow KIS (Keep It Simple), regarding code development.

@dlersch dlersch added this to the HUGS milestone Apr 18, 2024
@sgoldenCS
Copy link
Contributor

Created a new branch based on the common CSV parser branch #24. We need to determine a CSV dataset that we would like to use for the example. I asked @dlersch for ideas. In the meantime, I will bring in a scaler module from the exp_hall repo and make sure there are utests for it.

@dlersch
Copy link
Contributor Author

dlersch commented Apr 19, 2024

@sgoldenCS and I had a fruitful discussion about a possible data set that is simple enough to analyze but also highlights the functionality of the workflow / DS framework. We came up with a NP inspired classification problem: Identification of two species that are each characterized by three variables. The abundance between the individual species is asymmetric, i.e. species 1 is statistically dominant over 0. A plot of the corresponding distributions is shown below.

The data is 100% synthetic so we do not need to worry about any owner rights. The classification problem is set up such that it nicely fits into the narrative of HUGS, but it has no direct ties to NP. The data is spread over 4 .csv files so that we can use @sgoldenCS CSVParser right away. We might come up with a more challenging data set, but for now, we will stick to this one, just so that we can test and run the full workflow.

The data and the corresponding script for data generation are (for now) available on the ifarm:
/w/data_science-sciwork18/hugs24/example_data_hugs24
The file size of each .csv is ~18MB and we do not want to store them here on GitHub.
variable_correlations_v0

@sgoldenCS
Copy link
Contributor

The model module needs unit tests but is done otherwise. I will be adding an analysis module in the branch linked to this issue since it is the final step towards completion. I have pulled the changes from the model branch and main so it is fully up to date before adding the analysis module.
I will complete the model unit tests after I have an implementation of the analysis module for the GSPDA workshop (since it is tomorrow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants