Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore a Gaussian Process Regression #3

Open
4 tasks
SarahAlidoost opened this issue Nov 25, 2024 · 3 comments
Open
4 tasks

Explore a Gaussian Process Regression #3

SarahAlidoost opened this issue Nov 25, 2024 · 3 comments
Assignees
Labels
data-driven models data-driven models

Comments

@SarahAlidoost
Copy link
Member

SarahAlidoost commented Nov 25, 2024

We want to explore Gaussian Process regression models on hybrid labs data. For that, a python module needed to be called to fit, and predict. The module should:

  • similar to pycaret to compare different models, see examples
  • have a notebook in docs folder to show a simple example
  • have tests
  • have an easy interface and run fast (to be discussed)

Data:
The data folder exp699_032024_TUDelft is available on Atlas SharePoint at Documents > HybridLabs > Example_data. NOTE: The data folder exp699_032024_TUDelft is not public and cannot be shared with others. The trained model on this data cannot be also shared. For now, no need to store the trained model.

The data folder exp699_032024_TUDelft includes:

  • readme.md: it contains the experiment details and ML input/output
  • channel.csv: it contains the variable names and units
  • exp699.mat: it contains the data and it is in matlab format.

For reading data in python, see this notebook.

Literature:

Some literature is available at on Atlas SharePoint at Documents > HybridLabs > Literature. The two most related to FOWT are:

@SarahAlidoost SarahAlidoost added the data-driven models data-driven models label Nov 27, 2024
@SarahAlidoost SarahAlidoost moved this to Backlog in FOWT-ML - Sprint 1 Nov 27, 2024
@SarahAlidoost SarahAlidoost moved this from Backlog to Ready in FOWT-ML - Sprint 1 Nov 27, 2024
@SarahAlidoost
Copy link
Member Author

see #14

@vanlankveldthijs vanlankveldthijs self-assigned this Jan 20, 2025
@vanlankveldthijs vanlankveldthijs moved this from Ready to In progress in FOWT-ML - Sprint 1 Jan 24, 2025
@SarahAlidoost
Copy link
Member Author

As found by #17, the Gaussian process is computationally expensive with large data (lots of samples). Techniques like PCA can help a bit by reducing the number of features, but they’re not enough on their own. There are other approaches to tackle the issue e.g. Sparse Gaussian Processes, but scikit-learn does not natively support this. I suggest checking out some other techniques and packages like GPyTorch, see GPyTorch Regression Tutorial.

@vanlankveldthijs
Copy link
Collaborator

Yeah. In fact it seems that PCA has no effect whatsoever on this particular issue, because the issue isn't caused by the size of the data (observations x features), but only by the number of observations. PCA does not reduce the number of observations.

I did not explore any implementations outside of mlflow, which uses sklearn, because that was the objective for this sprint. I agree that other packages may provide out-of-the-box solutions for this issue.

@SarahAlidoost SarahAlidoost moved this from In progress to In review in FOWT-ML - Sprint 1 Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-driven models data-driven models
Projects
Status: In review
Development

No branches or pull requests

2 participants