clean up

rizavelioglu · Aug 19, 2022 · a1aecb5 · a1aecb5
commit a1aecb5
Show file tree

Hide file tree

Showing 17 changed files with 22,183 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,140 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+##########
+# Datasets
+/data
+
+# Models
+/models
+
+# PyCharm
+.idea/
+.DS_Store
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2021 rizavelioglu
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,43 @@
+# ML4ProM
+
+Please follow the notebooks to reproduce results:
+- [./notebooks/1_EDA.ipynb](./notebooks/1_EDA.ipynb) does **E**xploratory **D**ata **A**nalysis for each dataset and
+shows plots helping to understand datasets better,
+- [./notebooks/2_training.ipynb](./notebooks/2_training.ipynb) 
+
+
+
+### How to train models and output results?
+Inside the project directory (../ml4prom/) execute following to get to know more about the args:
+```python
+python -m src.models.train_model -h
+```
+which returns:
+```
+  --debug DEBUG         When True, plots ROC-Curve & Confusion Matrix
+  --seq_encoding SEQ_ENCODING
+                        Possible encodings; 'one-hot' & 'n-gram' where n is an integer
+  --unique_traces UNIQUE_TRACES
+                        when True, duplicate traces(trace variants) are removed from dataset
+  --remove_biased_feats REMOVE_BIASED_FEATS
+                        when True, the biased features are removed from dataset, e.g. patient is dead in COVID dataset
+```
+
+The following command does multiple things:
+- load all datasets
+- apply preprocessing, e.g. remove biased features, remove duplicate traces, etc.
+- encode traces (sequence of events)
+- train ML models with StratifiedKFold cross-validation
+- output a `.csv` file to `./reports/` including the accuracy scores
+```python
+python -m src.models.train_model --seq_encoding one-hot --remove_biased_feats --unique_traces
+```
+
+
+
+---
+#### Future work
+- time-series split for CV, see [scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split)
+- Encode data after train-test split! [see example code](https://stackoverflow.com/questions/55525195/do-i-have-to-do-one-hot-encoding-separately-for-train-and-test-dataset)
+- Check out [SHAP values](https://github.com/slundberg/shap)
+---