Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.

Integration with MultivariateStats.jl, enabling PCA/whitening across many SDMLayers #132

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

gottacatchenall
Copy link
Member

@gottacatchenall gottacatchenall commented Oct 25, 2021

What the pull request does
Implements methods for taking a set of SDMLayers and applying multivariate transformations from JuliaStats/MultivariateStats.jl to do a variety of things: reduce the dimensionality of a data set (PCA, PPCA, KernelPCA), make layers without correlation (Whitening), or the other methods mvstats provides
Type of change

Please indicate the relevant option(s)

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ❇️ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 This change requires a documentation update

Checklist

  • The changes are documented
    • The docstrings of the different functions describe the arguments, purpose, and behavior
    • There are examples in the documentation website
  • The changes are tested
  • The changes do not modify the behavior of the previously existing functions
    • If they do, please explain why and how in the introduction paragraph
  • For new contributors - my name and information are added to .zenodo.json
  • The Project.toml field version has been updated
    • Change the last number for a v0.0.x series release, a bug fix, or a performance improvement
    • Change the middle number for additional features that do not break the current test suite (unless you fix a bug in the test suite)
    • Change the first number for changes that break the current test suite

@codecov
Copy link

codecov bot commented Oct 25, 2021

Codecov Report

Merging #132 (f2b8627) into main (b3b0f32) will increase coverage by 0.95%.
The diff coverage is 55.04%.

@@            Coverage Diff             @@
##             main     #132      +/-   ##
==========================================
+ Coverage   54.13%   55.08%   +0.95%     
==========================================
  Files          35       36       +1     
  Lines        1101     1120      +19     
==========================================
+ Hits          596      617      +21     
+ Misses        505      503       -2     
Flag Coverage Δ
unittests 55.08% <55.04%> (+0.95%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/datasets/earthenv/topography.jl 0.00% <0.00%> (ø)
src/datasets/layernames.jl 0.00% <0.00%> (ø)
src/datasets/worldclim/elevation.jl 0.00% <0.00%> (ø)
src/integrations/DataFrames.jl 0.00% <0.00%> (ø)
src/operations/mask.jl 0.00% <0.00%> (ø)
src/operations/sliding.jl 0.00% <0.00%> (ø)
src/pseudoabsences/radius.jl 0.00% <0.00%> (ø)
src/pseudoabsences/randomselection.jl 0.00% <0.00%> (ø)
src/pseudoabsences/surfacerangeenvelope.jl 0.00% <0.00%> (ø)
src/recipes/recipes.jl 12.90% <12.90%> (-0.17%) ⬇️
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e5dc79...f2b8627. Read the comment docs.

@tpoisot
Copy link
Member

tpoisot commented Oct 26, 2021

Can I be a nuisance? I'd like it a lot if we could do something like

W = fit(PCA, stack_of_layers)
smaller_stack_of_layers = transform(W, stack_of_layers)

In some cases, we don't want to reduce the dimensionality as much as we would like to e.g. whiten the data. I'd drop that in a file loaded when MultivariateStats is loaded, using Requires.

@tpoisot
Copy link
Member

tpoisot commented Oct 26, 2021

This makes me realize that we may want to have a new type, like SimpleSDMStack, to store layers that pass the _layers_are_compatible test. This is one we can use in mosaic (even though mosaic works when the overlap is incomplete), but would definitely work here. If we want to be smart about it, we can also have a 3d matrix and an array of names, and from this we extract the layers as required. I can suggest something in another PR.

@gottacatchenall
Copy link
Member Author

overloading fit and transform should be easy enough

@gottacatchenall gottacatchenall changed the title [new feature] dimensionality reduction across many SDMLayers via MultivariateStats.jl [new feature] mv transformations across many SDMLayers via MultivariateStats.jl Oct 26, 2021
Project.toml Outdated Show resolved Hide resolved
src/SimpleSDMLayers.jl Outdated Show resolved Hide resolved
src/SimpleSDMLayers.jl Outdated Show resolved Hide resolved
src/operations/transform.jl Outdated Show resolved Hide resolved
src/operations/transform.jl Outdated Show resolved Hide resolved
src/operations/transform.jl Outdated Show resolved Hide resolved
src/operations/transform.jl Outdated Show resolved Hide resolved
@gottacatchenall
Copy link
Member Author

gottacatchenall commented Oct 26, 2021

@tpoisot mostly there, current code works for PCA and PPCA but throws errors for KernelPCA and Whitening, although those could be a product of using randomly generated matrices without correlation in the current tests.

also given the function call structure it requires computing the input matrix in both fit and transform, might start to cause performance concerns for big/many layers

do you want to merge this into a different branch to implement SDMStacks?

@tpoisot
Copy link
Member

tpoisot commented Oct 28, 2021

I don't think this needs to have stacks yet - but we can merge, and add the stacks later?

@gottacatchenall
Copy link
Member Author

gottacatchenall commented Oct 28, 2021

just pushed docs and a type assert for PCA/PPCA for now, do you merge into a dev branch before merging to main and tagging? @tpoisot

@tpoisot
Copy link
Member

tpoisot commented Oct 28, 2021

So I don't think a type assert is required - Whitening fails sometimes because it requires a specific matrix type.

I'll review and merge/tag, there's no dev branch.

@gottacatchenall
Copy link
Member Author

removed the type assert

@tpoisot
Copy link
Member

tpoisot commented Nov 8, 2021

Can you add a documentation update?

@tpoisot
Copy link
Member

tpoisot commented Nov 9, 2021

there's a _make_input method that doesn't look defined

@gottacatchenall
Copy link
Member Author

looks like i dropped that when shifting to the common_keys = reduce(∩, keys.(layers)) call but didn't add it to fit, should work now

@tpoisot
Copy link
Member

tpoisot commented Nov 9, 2021

also fit(Whitening, layers) is still failing with non-posdef matrix on the same input data as the doc example

this probably "just" need a little explanation that sometimes Whitening giveth, and sometimes it taketh away

@gottacatchenall
Copy link
Member Author

fixed the whitening example (i think), needed a slightly different overload of transform

@gottacatchenall
Copy link
Member Author

actuatlly i might be overcomplicating it a bit

@gottacatchenall
Copy link
Member Author

The implementation of transform could be one function but runs into ambiguity with

transform(f::Whitening, x::AbstractVecOrMat{T} where T) in MultivariateStats
transform(proj::T, layers::AbstractVecOrMat{U}, kwargs...) where {T, U<:SimpleSDMLayer} in SimpleSDMLayers

@gottacatchenall
Copy link
Member Author

now implemented via a single function call, the only irritant being in the case that if SimpleSDMLayers and MultivariateStats are both in the namespace, it requires explicit usage of SDM.transform/MV.transform

@gottacatchenall gottacatchenall changed the title [new feature] mv transformations across many SDMLayers via MultivariateStats.jl Integration with MultivariateStats.jl, enabling PCA/whitening across many SDMLayers Nov 9, 2021
@tpoisot
Copy link
Member

tpoisot commented Nov 9, 2021

now implemented via a single function call, the only irritant being in the case that if SimpleSDMLayers and MultivariateStats are both in the namespace, it requires explicit usage of SDM.transform/MV.transform

Wait, there's a transform in SDM?

@gottacatchenall
Copy link
Member Author

gottacatchenall commented Nov 9, 2021

not at the moment, but implementing a transform method within the SDM namespace (only in the case MVStats is also loaded) is a way to avoid the method ambiguity issue above (which in the long term could probably be solved by a PR within MVStats.jl)

edit: no need to export if it is dependent on MVStats being loaded, it just requires SDMs.transform() which isn't ideal

essentially this is a problem with MVStats.jl dispatch though

@gottacatchenall
Copy link
Member Author

gottacatchenall commented Nov 9, 2021

In the long term I think the julianesque solution is for MVStats.jl to define all its (fit, transform) methods on different models (pca, ppca, whitening) around an a single method dispatched on a single abstract type, as opposed to their current definitions which are fit(::PCA, matrix) = ..., fit(::Whitening, matrix) = ..., etc.

@tpoisot
Copy link
Member

tpoisot commented Nov 9, 2021

So can't we temporarily solve this with generated code?

@gottacatchenall
Copy link
Member Author

macros are probably the feature of julia i understand the least, but if that's a neat way to resolve the transform function call ambiguity i'm all ears

@tpoisot
Copy link
Member

tpoisot commented Nov 10, 2021

Let me try

@tpoisot
Copy link
Member

tpoisot commented Nov 10, 2021

OK so I don't think it needed the code generation - I'll check the example page before releasing

@gottacatchenall
Copy link
Member Author

yeah it runs into the same dispatch ambiguity without defining a transform in the SDM namespace

@gottacatchenall
Copy link
Member Author

gottacatchenall commented Nov 10, 2021

there's also an inconsistency in how transform is defined across different mv transforms in MVStats.jl which should probably get raised as an issue there

@gottacatchenall
Copy link
Member Author

gottacatchenall commented Nov 10, 2021

seems there is an MVStats.jl issue about exactly this issue, not sure how active MVStats.jl development is

@tpoisot
Copy link
Member

tpoisot commented Nov 12, 2021

Mhhhh. Running test doesn't show a method error on my side, are you sure the test suite is complete?

@gottacatchenall
Copy link
Member Author

i think it doesn't fail because i use newlayers for both plots in the docs src. but clearly doesn't work because the plot under the whitening example in the docs is the same as the above plot and should only be 2 maps not 4

@gottacatchenall
Copy link
Member Author

running locally gives me the same dispatch ambiguity

@gottacatchenall
Copy link
Member Author

and i should 100% add a test of Whitening, the issue is how to get a predefined input that it doesn't randomly fail on w/ pos-def matrix problem

@gottacatchenall
Copy link
Member Author

Visited this again, still can't find a simple way around this ambiguity

MethodError: transform(::MultivariateStats.Whitening{Float64}, ::Vector{SimpleSDMResponse{Float64}}) is ambiguous. Candidates:
    transform(f::MultivariateStats.Whitening, x::AbstractVecOrMat) in MultivariateStats at /home/michael/.julia/packages/MultivariateStats/HTpHt/src/whiten.jl:38
    transform(proj::P, layers::AbstractVecOrMat{U}; kwargs...) where {U<:SimpleSDMLayer, P<:Union{MultivariateStats.PCA, MultivariateStats.Whitening}} in SimpleSDMLayers at /home/michael/code/SimpleSDMLayers.jl/src/integrations/MultivariateStats.jl:26
  Possible fix, define
    transform(::P, ::AbstractVecOrMat{U}) where {U<:SimpleSDMLayer, P<:MultivariateStats.Whitening}

The possible fix from the error message is implemented but doesn't actually fix the problem

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants