Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed GP fitness again #301

Merged
merged 1 commit into from
Jan 9, 2025
Merged

Fixed GP fitness again #301

merged 1 commit into from
Jan 9, 2025

Conversation

jmafoster1
Copy link
Contributor

The fitness function would still, very rarely, give a fitness that was a complex number. I think this was because some of the predicted values from candidate functions would evaluate sqrt(-1), which would then give complex distances, and thus a complex fitness. I now return float("inf") if the dtype of the predicted values is not the same as the dtype of the expected values, which should hopefully fix the problem in a robust way.

Copy link

github-actions bot commented Jan 9, 2025

🦙 MegaLinter status: ✅ SUCCESS

Descriptor Linter Files Fixed Errors Elapsed time
✅ PYTHON black 36 0 0.98s
✅ PYTHON pylint 36 0 5.87s

See detailed report in MegaLinter reports

MegaLinter is graciously provided by OX Security

Copy link

codecov bot commented Jan 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.02%. Comparing base (f22af96) to head (1cd053e).
Report is 3 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #301   +/-   ##
=======================================
  Coverage   97.02%   97.02%           
=======================================
  Files          29       29           
  Lines        1849     1849           
=======================================
  Hits         1794     1794           
  Misses         55       55           
Files with missing lines Coverage Δ
...stimation/genetic_programming_regression_fitter.py 98.86% <100.00%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7bf3f4c...1cd053e. Read the comment docs.

@jmafoster1 jmafoster1 marked this pull request as ready for review January 9, 2025 09:28
@jmafoster1 jmafoster1 requested a review from f-allian January 9, 2025 09:28
@f-allian
Copy link
Contributor

f-allian commented Jan 9, 2025

@jmafoster1 Can you explain in a bit more detail what the problem is here? It doesn't make sense for any of the predicted values to be complex so I think hard-coding a condition to eliminate them is probably not the best way of resolving it

@jmafoster1
Copy link
Contributor Author

The problem is that candidate expressions are generated at random in GP, so there is the possibility of evaluating the square root of a negative number during GP. We cannot prevent this unless we remove the sqrt operator from the set of operators. Some versions of sqrt give an exception if you try to evaluate a negative (which we catch in the fitness method), but the one I've been using returns an instance of np.complex128. This PR addresses that by giving candidate expressions and infinite fitness (i.e. really bad) if they produce an output that's of a different type to the observed output. Does that make sense?

@f-allian
Copy link
Contributor

f-allian commented Jan 9, 2025

The problem is that candidate expressions are generated at random in GP, so there is the possibility of evaluating the square root of a negative number during GP. We cannot prevent this unless we remove the sqrt operator from the set of operators. Some versions of sqrt give an exception if you try to evaluate a negative (which we catch in the fitness method), but the one I've been using returns an instance of np.complex128. This PR addresses that by giving candidate expressions and infinite fitness (i.e. really bad) if they produce an output that's of a different type to the observed output. Does that make sense?

This doesn't make much sense to me. If, for whatever strange reason, your y_estimates is yielding an array of complex numbers then your current formula for nrmse isn't appropriate. You would instead have to calculate the magnitude of the sum of squares, i.e:

Edit:

nrmse = np.abs(sqerrors.sum() / len(self.df)) / (self.df[self.outcome].max() - self.df[self.outcome].min())

What I meant was:

sqerrors = np.abs(self.df[self.outcome] - y_estimates) ** 2
nrmse = np.sqrt(sqerrors.sum() / len(self.df)) / (self.df[self.outcome].max() - self.df[self.outcome].min())

Does that make sense?

@jmafoster1
Copy link
Contributor Author

My point here is that if it's returning an array of complex numbers, then the candidate expression is wrong, so should be assigned infinite fitness (we are minimising here, so fitness infinity is infinitely bad).

@f-allian
Copy link
Contributor

f-allian commented Jan 9, 2025

My point here is that if it's returning an array of complex numbers, then the candidate expression is wrong, so should be assigned infinite fitness (we are minimising here, so fitness infinity is infinitely bad).

Sorry, had a typo in my above comment (see above).

If some candidate expressions are wrong/complex dtypes, can you not filter them out instead and avoid doing all of this?

@jmafoster1
Copy link
Contributor Author

Unfortunately not. Every individual in the population must have a fitness value assigned to it. Better individuals will persist across generations of the population, with poorer individuals being filtered out (based on fitness value). However, in this case, there is no easy way to generate guaranteed valid individuals (i.e. individuals which will always produce real values). The best we can do is give invalid individuals very poor fitness values so that they (hopefully) do not persist for long. It's a fairly standard practice in GP.

@jmafoster1
Copy link
Contributor Author

The ideal situation would be to do this in a strongly typed language, so we could guarantee that every individual was at least valid, but that's just a limitation of doing it in Python

@f-allian
Copy link
Contributor

f-allian commented Jan 9, 2025

Unfortunately not. Every individual in the population must have a fitness value assigned to it. Better individuals will persist across generations of the population, with poorer individuals being filtered out (based on fitness value). However, in this case, there is no easy way to generate guaranteed valid individuals (i.e. individuals which will always produce real values). The best we can do is give invalid individuals very poor fitness values so that they (hopefully) do not persist for long. It's a fairly standard practice in GP.

It sounds like you've thought it through, but I can't quite agree with this approach. Assigning specific fitness values to a selected group of individuals is fine, and sounds like some form of regularisation. But it sounds like the fitness function/model you're employing is probably not well-constrained. I'll approve this PR but it might be worth something coming back to in the future IMO.

@jmafoster1 jmafoster1 merged commit 641107f into main Jan 9, 2025
22 checks passed
@jmafoster1 jmafoster1 deleted the jmafoster1/fix-gp-fitness branch January 9, 2025 13:03
@jmafoster1
Copy link
Contributor Author

Thanks Farhad. Yes, I'm not really a fan. DEAP has lots of limitations and weird workarounds like this, but it's the most established and best documented toolkit for genetic algorithms that I've found so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants