Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store only the covariance matrix in CoxModel, not the information #41

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ararslan
Copy link
Member

@ararslan ararslan commented Jul 10, 2022

Currently we're storing both the (observed) Fisher information matrix and its inverse, the variance-covariance matrix, separately in the CoxModel object. We aren't actually using the information matrix for anything though, so we can save space by storing only the covariance matrix. If folks need the information matrix (and indeed, if we extend informationmatrix(::CoxModel; expected=false)), we can take the inverse as needed.

Alternatively, we could store a Cholesky decomposition of the information (or covariance), which is what this PR originally did.

Additionally, on main, the covariance matrix is obtained from the observed information via pinv, which allows obtaining a covariance matrix even in the case of rank deficiency in the model matrix, which doesn't really make sense. Instead, we can check for rank deficiency ahead of time and error out early before even attempting to fit a model.

@codecov-commenter
Copy link

codecov-commenter commented Jul 10, 2022

Codecov Report

Base: 99.63% // Head: 99.64% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (e2810db) compared to base (12164f5).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #41   +/-   ##
=======================================
  Coverage   99.63%   99.64%           
=======================================
  Files           6        6           
  Lines         276      283    +7     
=======================================
+ Hits          275      282    +7     
  Misses          1        1           
Impacted Files Coverage Δ
src/cox.jl 99.31% <100.00%> (+0.03%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Currently we're storing both the (observed) Fisher information matrix
and its inverse, the variance-covariance matrix, separately in the
`CoxModel` object. Since we know the observed information is positive
semi-definite, we can instead store a (possibly pivoted) Cholesky
factorization of the information. That allows us to easily construct
both the original matrix and its inverse without needing to store both.

In the case of few parameters (like ≤3) and positive semi-definite
information but not positive definite, this will actually use a bit more
memory than storing the matrices separately. However, for any more
parameters, regardless of whether the information is positive definite,
this saves a good amount of memory, with the difference becoming more
pronounced with more parameters.
@andreasnoack
Copy link
Member

I'm in favor of this but I'm wondering if we should just use Cholesky. We only use this field for computing the uncertainty after the estimation, right, and the inverse calculation will fail in the singular case.

@ararslan
Copy link
Member Author

We only use this field for computing the uncertainty after the estimation, right

AFAICT the Fisher information field isn't meaningfully used anywhere, but the covariance matrix field is indeed used for uncertainty. The fields store the Hessian and its inverse after fitting, they aren't being passed anywhere during fitting.

and the inverse calculation will fail in the singular case.

The covariance matrix is currently being computed via pinv, so we do get a result even in the singular case.

@andreasnoack
Copy link
Member

But with this PR it would fail, no? I'm not sure using pinv is sound when computing a variance estimate so failing might be better.

@ararslan
Copy link
Member Author

But with this PR it would fail, no?

No, at least as written it would use a pivoted Cholesky factorization.

I'm not sure using pinv is sound when computing a variance estimate so failing might be better.

Yeah I agree, not sure why it uses pinv currently (though I guess it's likely specifically to handle this case). It would be nice to be able to determine that case ahead of time to throw an error early and avoid fitting the model altogether if possible.

@ararslan
Copy link
Member Author

Would it be enough to add something like

rank(M) < min(size(M)...) && throw(ArgumentError("model matrix is not full rank"))

to the fit(CoxModel, ...) method?

Your point about only using the field for uncertainty is a very good one; I wonder whether it's even worth storing the decomposition of the Fisher information since we primarily care about the the covariance matrix. We could store that instead, either dense or factorized. Thoughts on that?

src/cox.jl Outdated
@@ -119,7 +118,7 @@ StatsAPI.dof(obj::CoxModel) = length(coef(obj))

StatsAPI.dof_residual(obj::CoxModel) = nobs(obj) - dof(obj)

StatsAPI.vcov(obj::CoxModel) = obj.vcov
StatsAPI.vcov(obj::CoxModel) = inv(obj.chol)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that this one will error in the rank deficient case since

julia> inv(cholesky([0 0; 0 1], Val(true), check=false))
ERROR: SingularException(2)
Stacktrace:
 [1] chknonsingular
   @ /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/LinearAlgebra/src/lapack.jl:49 [inlined]
 [2] potri!(uplo::Char, A::Matrix{Float64})
   @ LinearAlgebra.LAPACK /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/LinearAlgebra/src/lapack.jl:3118
 [3] inv(C::CholeskyPivoted{Float64, Matrix{Float64}})
   @ LinearAlgebra /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/LinearAlgebra/src/cholesky.jl:688
 [4] top-level scope
   @ REPL[150]:1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh okay yeah, I see. And in that case then there's no point in using a pivoted factorization since we primarily care about the inverse. 👍

Could still be worth checking for collinearity ahead of time though.

@ararslan
Copy link
Member Author

ararslan commented Jul 26, 2022

Based on our discussion, the title of the PR is now inaccurate and (I've updated that and the OP) what's implemented is the following:

  • Throw an error if X in fit(CoxModel, X, y) is rank deficient
  • Store only the covariance matrix, not the observed information
  • Store the covariance matrix as a Symmetric-wrapped matrix rather than just a matrix

Let me know what you think, @andreasnoack!

@ararslan ararslan changed the title Store the Cholesky of the information in CoxModel Store only the covariance matrix in CoxModel, not the information Jul 27, 2022
@@ -224,6 +228,9 @@ Cox proportional hazard model estimate of coefficients. Returns a `CoxModel`
object.
"""
function StatsAPI.fit(::Type{CoxModel}, M::AbstractMatrix, y::AbstractVector; tol=1e-4, l2_cost=0)
if rank(M) < min(size(M)...)
throw(ArgumentError("model matrix is not full rank; some terms may be collinear"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could add a dropcollinear keyword argument to mimic GLM. This is probably hot garbage but I think we could implement it naively within the existing framework as

r = rank(M)
if r < min(size(M)...)
    if dropcollinear
        F = qr(M, ColumnNorm())
        pivot = F.jpvt
        keepcols = pivot[1:r]
        dropcols = pivot[(r + 1):end]
        M = M[:, sort(keepcols)]
    else
        throw(ArgumentError("model matrix is not full rank and `dropcollinear=false`"))
    end
end

We could then use dropcols to either go back and re-add terms with a coefficient of 0 or emit some kind of informative warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, R's coxph fits the model and keeps around the redundant terms but the result has all NA for the coefficients and statistics and whatnot for those terms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants