Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cox Regression #2

Merged
merged 26 commits into from
Dec 16, 2017
Merged

Cox Regression #2

merged 26 commits into from
Dec 16, 2017

Conversation

piever
Copy link
Contributor

@piever piever commented Aug 8, 2017

I've added Cox Regression (from my previous implementation in my own package after some clean up). It took longer than expected but my old code needed quite a bit polishing...

There are a few caveats though:

  • The formula needs to have an explicit 0+... for the regression to work
  • The code I wrote works both for Vector{EventTime} and EventTimeVector, except DataFrames + ModelFrame only work well together with EventTimeVector. Even if I add some DataFrames.upgrade_vector method to EventTimeVector, for some reason ModelFrame screws it up and only takes the time component of the vector. Also, DataArray conversion of a EventTimeVector does not really seem feasible as the data part of a DataArray is supposed to be an Array, which EventTimeVector is not. Vector{EventTime} seems to work just fine on the other hand.
  • I'm not sure how to set up my fork to build the docs locally and check that everything is fine when I change/add things, how would that work?
  • In my package I have the Nelson-Aalen estimator for the cumulative hazard, which can be easily generalized to get an estimate of baseline cumulative hazard for a fitted Cox Model. It really is the same as Kaplan Meier except this line would be replaced by something like:
    chaz += d_i / n_i
    Maybe it makes more sense that you add this feature as you know better how to integrate it with the rest of the package, otherwise I can give it a shot.

Everything else seems to work well: I get the same results as this paper (check page 6) on a test dataset that I have included in the tests. Performance seems pretty good but I no longer have anything to compare it with as I'm only set up to do Cox Regression in Julia on my machine: if you have some other set up (i.e. R or matlab) I'd be curious to see some benchmarking.

On the more technical side, I'm fitting Cox Regression maximizing the "partial log likelihood" (formulas here) using Efron's method in case of ties. I'm computing value, gradient and hessian and using Newton algorithm with backtracking (implemented in the optimization.jl file, it actually is such a simple algorithm that I don't think there's much need to depend on some optimization package and I preferred to simply re-implement it).

@piever
Copy link
Contributor Author

piever commented Aug 8, 2017

Here's a code snippet to try it out:

using Survival, DataFrames
filepath = joinpath(Pkg.dir("Survival", "test"), "rossi.csv")
rossi = readtable(filepath)
rossi[:event] = EventTime.(rossi[:week],rossi[:arrest] .== 1)
outcome = coxph(@formula(event ~ 0+ fin+age+race+wexp+mar+paro+prio), rossi; tol = 1e-8)

@ararslan
Copy link
Member

ararslan commented Aug 8, 2017

I'M JUST SO HAPPY. This is awesome, @piever, thank you! The 0.7 CI failures are due to some macro changes in Base that have been breaking everything.

I'll do a more thorough review of this soon. Thanks again!

@ararslan
Copy link
Member

ararslan commented Aug 8, 2017

Maybe it makes more sense that you add this feature as you know better how to integrate it with the rest of the package

I've actually been meaning to generalize the internal Kaplan-Meier function to update a state object that can be specialized for different estimators. I realized when I went to implement cumulative incidence with competing risks that it's a trivial modification of that function. I hadn't looked into Nelson-Aalen yet but if it's in a similar boat then that's all the more reason for a generalization there. 🙂

@ararslan
Copy link
Member

ararslan commented Aug 8, 2017

using Efron's method in case of ties.

👍 I had started working on Cox regression a while back and I had only done Breslow ties because I'm lazy. I do love that Norm Brewslow himself has said not to use that estimator. 😂 It's just so easy to implement though!

I don't think there's much need to depend on some optimization package and I preferred to simply re-implement it

💯

The formula needs to have an explicit 0+... for the regression to work

I wonder if we could inspect the Formula object and remove an intercept if it exists.

As an aside, I'd be interested to hear your opinion of EventTimeVector in general. I'm not super convinced that it's all that convenient or useful and a while back I considered just ditching it altogether.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! The details of the algorithm would take me far too much time to understand, but I have a few more basic comments.

test/runtests.jl Outdated
@@ -199,3 +201,23 @@ end
@test_throws DimensionMismatch fit(KaplanMeier, [1, 2], [true])
@test_throws ArgumentError fit(KaplanMeier, Float64[], Bool[])
end

@testset "Cox" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves more tests, with a few variants of the model covering all common scenarios and corner cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests never hurt 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I've just noticed that, as ModelFrame always gives you matrices of Float64 that's the only thing I've been testing, so I should probably make sure that other types behave well. What else do you have in mind?

test/runtests.jl Outdated

@test nobs(outcome) == size(rossi, 1)
@test dof(outcome) == 7
@test loglikelihood(outcome) > nullloglikelihood(outcome)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just hardcode the values in the file.

@@ -0,0 +1,17 @@
function newton_raphson(fgh!, x::AbstractArray{T}; ρ = 0.5, c = 1e-4, tol = 1e-4, max_iter = 1000) where T
grad= zeros(T, length(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space before = here and below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also noted a few missing spaces after commas here and there.

end
x = x + search_dir*step_size
end
error("Reached maximum number of iterations")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use StatsBase.ConvergenceException.

src/coxph.jl Outdated
function _coxph(X::AbstractArray{T}, s::AbstractVector; l2_cost = zero(T), kwargs...) where T
c = CoxAux(X, s, l2_cost)

fgh! = (β,grad,hes, compute_derivatives) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just pass _cox_fgh! to newton_raphson?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_cox_fgh! has one extra argument (the CoxAux type with all the model info). As I plan to use newton_raphson for other fits related to survival analysis, I prefer to keep it as a generic minimization routine.

step_size *= ρ
step_size > 1e-10 || error("Linesearch failed! Problematic Hessian or Gradient?")
end
x = x + search_dir*step_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to update x in place?

src/coxph.jl Outdated
end
end
end
y += λ*(β' * β)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .+=?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the right- and left-hand sides of += are scalars in this case though; note that y is defined to be zero(T) and β'β will be a scalar since it's a RowVector times a Vector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how β'β compares in terms of performance with dot(β, β) and sum(x->x^2, β).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some testing, and the two first forms are essentially of equivalent speed. The sum approach is quite slower though, which is quite surprising (maybe worth filing a bug in Base?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least on 0.6, I see the opposite:

julia> x = rand(10_000);

julia> @benchmark $x'*$x
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.333 μs (0.00% GC)
  median time:      7.041 μs (0.00% GC)
  mean time:        7.718 μs (0.00% GC)
  maximum time:     39.177 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark dot($x, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.545 μs (0.00% GC)
  median time:      6.647 μs (0.00% GC)
  mean time:        6.931 μs (0.00% GC)
  maximum time:     20.233 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark sum(i->i^2, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.808 μs (0.00% GC)
  median time:      2.831 μs (0.00% GC)
  mean time:        2.912 μs (0.00% GC)
  maximum time:     10.364 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

Not surprising that the x'x form is equivalent in speed to dot(x, x), since @code_lowered shows that the former is calling dot. TIL. 🙂

I tried to benchmark on master too but BenchmarkTools seems completely broken on master.

src/coxph.jl Outdated

if compute_derivatives
for k1 in 1:length(β)
grad[k1] += 2*λ*β[k1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space.

src/cox_utils.jl Outdated
end


# fs = index first deaths, ls = index last deaths,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to use a more neutral term than "death". :-) Maybe "time-to-event" or just "time"?

src/cox_utils.jl Outdated
@@ -0,0 +1,78 @@
# Compute first and last
firsts(s) = [s[t].status && (t==1 || s[t] > s[t-1]) for t = 1:length(s)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you could create generators rather than arrays here, that would avoid a few allocations (not sure it matters).

@nalimilan
Copy link
Member

The upgrade_vector issue should soon go away when we backport DataTables improvements to DataFrames: the constructor will now just use whatever column type you passed without trying to "upgrade" it. OTOH supporting missing values with EventTimeVector is going to require a custom implementation. Maybe better leave this as an optimization for the compiler, which could decide to store Vector{EventTime} as two separate vectors, one for each of the fields (which should save some memory due to alignment constraints).

src/cox_utils.jl Outdated
end

function CoxSum(values::Array{T,N}, fs, ls) where {T, N}
sums = zeros(T, size(values,1)+1, Base.tail(size(values))...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just micro-optimizing here, but it might be a little more efficient to do it like

function CoxSum(values::Array{T,N}, fs, ls) where {T,N}
    tailsize = Base.tail(size(values))
    sums = zeros(T, size(values, 1) + 1, tailsize...)
    chunks = zeros(T, length(fs), tailsize...)
    CoxSum{T,N}(values, sums, chunks, copy(chunks))
end

test/runtests.jl Outdated
outcome_coefmat = coeftable(outcome)

filepath_coefs = joinpath(Pkg.dir("Survival", "test"), "expected_coefmat.jld")
expected_coefmat = JLD.load(filepath_coefs, "expected_coefmat")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we could just test against hard-coded values here rather than adding a test dependency on JLD? JLD has a tendency to break easily on Julia nightly since it relies a lot on internal details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GLM.jl we just hardcode the coefficients values as inline arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, also because I keep getting annoying warnings from JLD, it's probably just better to type in the values in the julia file.

REQUIRE Outdated
@@ -1,3 +1,4 @@
julia 0.6.0-pre
StatsBase
Distributions
PositiveFactorizations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this package. What does it provide that's used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for Newton algorithm to work far from a local minimum, when the hessian may not be positive defined, it's good to have an approximation of the inverse hessian that is positive defined, to ensure that the value of the objective function goes down and PositiveFactorizations is a small package that does just that: I use it here. Just using the inverse of the hessian would probably still work but in a less reliable way. Optim also has it as a dependency for the same reason I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay great, thanks for the explanation.

@piever
Copy link
Contributor Author

piever commented Aug 9, 2017

Thanks for all the feedback!

I've actually been meaning to generalize the internal Kaplan-Meier function to update a state object that can be specialized for different estimators. I realized when I went to implement cumulative incidence with competing risks that it's a trivial modification of that function. I hadn't looked into Nelson-Aalen yet but if it's in a similar boat then that's all the more reason for a generalization there.

Nelson-Aalen gives you the cumulative hazard (i.e. -log(survival)) and it has the advantage that generalizing it to a Cox model fit is completely trivial. Some people actually estimate the survival function by exponentiating the Nelson-Aalen estimator. Instead, generalizing Kaplan-Meier to a Cox model fit is slightly more convoluted, especially in case of ties.

As an aside, I'd be interested to hear your opinion of EventTimeVector in general. I'm not super convinced that it's all that convenient or useful and a while back I considered just ditching it altogether.

I'm not very fond of the EventTimeVector to be entirely honest. Creating a Vector{EventTime} is very easy with the dot broadcast syntax, plus it integrates nicely with DataFrames and DataArrays and it already works in case of missing data.

@piever
Copy link
Contributor Author

piever commented Aug 9, 2017

I should have addressed most remarks. I've improved the type management, so that now it should be able to accept matrices of any type without giving unexpected method errors (which I guess were one of the corner cases @nalimilan had in mind).

The remaining big thing left to do for Cox is to have an estimate of the baseline cumulative hazard, but I guess that should be together with an implementation of Nelson Aalen (which I'd be happy do in a separate PR, after this one is merged).

In terms of performance I'm reasonably satisfied with this implementation (approx. 5 times faster than matlab on my colleague's computer on this dataset that I'm testing). The main thing it lacks is multithreading, but I thought that should wait until threading in julia becomes a bit more stable.

@@ -13,5 +13,6 @@ function newton_raphson(fgh!, x::AbstractArray{T}; ρ = one(T)/2, c = 1e-4, tol
end
x .+= search_dir*step_size
end
error("Reached maximum number of iterations")
warn("Reached max iteration number: try increasing `max_iter` or `tol` parameters")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not necessary to warn before throwing an error. The ConvergenceException text should be reasonably informative.

docs/src/cox.md Outdated
\lambda(t|X_i) = \lambda_0(t)\exp(X_i\cdot\beta)
```

where $\lambda(t|X_i)$ is the estimated hazard for sample $i$, $\lambda_0$ is the baseline hazard, $X_i$ the vector of covariates for sample $i$ and $\beta$ are the coefficients of the regression.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenter uses double backticks, e.g. ``i``, for inline math rather than TeX-style $i$.

Copy link
Contributor Author

@piever piever Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'm a Documenter newbie: how can I look at the docs locally? I've tried:

MacBook-Pro:docs pietro$ mkdocs serve
INFO    -  Building documentation... 

Config file '/Users/pietro/.julia/v0.6/Survival/docs/mkdocs.yml' does not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I can just julia make.jl and open the generated index.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, Documenter doesn't use mkdocs. On the command line, cd'd into the Survival.jl directory, you can do

julia -e 'include(joinpath("src", "Survival.jl")); include(joinpath("docs", "make.jl"))'

Just make sure you've done Pkg.add("Documenter") from within Julia first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, you beat me to the explanation. 🙂

@ararslan
Copy link
Member

ararslan commented Aug 9, 2017

In terms of performance I'm reasonably satisfied with this implementation (approx. 5 times faster than matlab on my colleague's computer on this dataset that I'm testing)

That's awesome. Nice work!!

The main thing it lacks is multithreading, but I thought that should wait until threading in julia becomes a bit more stable.

Yeah, agreed. Multithreading is still considered experimental in Julia, though some packages do use it, e.g. GLM. Probably not worth going down that road until it's solid though. Hopefully in Julia 1.0.

I'm not very fond of the EventTimeVector to be entirely honest.

Okay thanks, I appreciate your input. Maybe we should just get rid of it then in a future PR. Sometimes convenience is more important than speed. (A good example of that is DataArrays, which are convenient but slow, versus NullableArrays, which are fast but very inconvenient.)

@piever
Copy link
Contributor Author

piever commented Aug 9, 2017

Okay thanks, I appreciate your input. Maybe we should just get rid of it then in a future PR. Sometimes convenience is more important than speed. (A good example of that is DataArrays, which are convenient but slow, versus NullableArrays, which are fast but very inconvenient.)

Don't get me wrong: I'm impressed that you got a more efficient version of a Vector{EventTime} to behave just like Vector{EventTime} to an extent that, even though I had designed Cox for Vector{EventTime}, when I fed it EventTimeVector it still worked. It's mainly that I believe that a lot of users could be confused by having the two things. Also, having it interact with the rest of the Julia data ecosystem could be tricky:

  • missing data would have to be reimplemented
  • I'm not sure how it would work to write/read the EventTimeVector to CSV with the existing tools (i.e. CSV or TextParse or DataFrames.readtable). Instead adding simple parser for EventTime would allow to write it and read it easily just by specifying the type on the csv reader

@ararslan
Copy link
Member

ararslan commented Aug 9, 2017

I agree completely. Like I said I was never that convinced that it was useful to have EventTimeVector. (Plus I never did benchmarks, so who knows whether it was actually more efficient than Vector{EventTime} 😛)

@piever piever changed the title RFC: Cox Regression Cox Regression Aug 14, 2017
@piever
Copy link
Contributor Author

piever commented Aug 14, 2017

Please let me know if there is more feedback I need to incorporate. As far as I'm concerned, I'm happy with the current version :)

@ararslan
Copy link
Member

The implementation looks great to me. Tests for more corner cases would be good but I'm not really sure what to suggest offhand. Maybe @nalimilan would have some ideas?

@piever
Copy link
Contributor Author

piever commented Aug 17, 2017

Actually there still is one thing to figure out: the test using lower precision floating points converges with a given accuracy in a given number of iterations on my machine but not on Travis, I should maybe try a larger tolerance to the test. The only other tests that come to mind are:

  • using interaction terms
  • using a nonzero L2 cost

@ararslan
Copy link
Member

Actually there still is one thing to figure out: the test using lower precision floating points converges with a given accuracy in a given number of iterations on my machine but not on Travis

Huh, interesting. Travis runs on x86-64 Linux, dunno what the other specs are for their VMs. What is the machine you're testing that on?

@piever
Copy link
Contributor Author

piever commented Aug 17, 2017

I was also surprised, this is my versioninfo:

julia> versioninfo()
Julia Version 0.6.0
Commit 903644385b (2017-06-19 13:05 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Maybe they have some different libraries (MKL versus openblas or something like that). I guess the only thing to do is lower the accuracy required when fitting with Float32 in the tests to make sure it also passes on Travis.

@ararslan
Copy link
Member

This is Travis':

$ julia -e 'versioninfo()'
Julia Version 0.6.0
Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

Travis uses the official Julia binaries, so it should be OpenBLAS rather than MKL. Kinda mysterious... I can't imagine that it would be an Ivy Bridge vs. Haswell thing.

@nalimilan
Copy link
Member

I can't imagine that it would be an Ivy Bridge vs. Haswell thing.

Doesn't sound impossible to me. Maybe we're just unlucky that a small difference affects convergence. Float32 is a fragile type, isn't it? What's funny it that it works fine here on my skylake-avx512 under Linux... Maybe remove one or two variables from the formula?

Regarding cases to test, I would try to cover a few scenarios, like: model without independent variables, model with a single continuous variable, model with a single categorical variable, model with both. Same for interactions: interaction between continuous variables, interaction between categorical variables, interaction between continuous and categorical. Of course there's no limit to the number of combinations one can try, so you need to stop at some point. :-)

src/coxph.jl Outdated
"""
fit(::Type{CoxModel}, M::AbstractMatrix, y::AbstractVector; kwargs...)

Given a matrix M of predictors and a corresponding vector of events, compute the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M

src/Survival.jl Outdated

abstract type AbstractEstimator end
abstract type NonparametricEstimator <: AbstractEstimator end

include("eventtimes.jl")
include("kaplanmeier.jl")
include("cox_utils.jl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ararslan will tell, but why not group all three Cox-related files in a single file? Doesn't sound too big. That will keep the package cleaner when more models will be supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like the idea of the Cox stuff grouped as a single file.

src/coxph.jl Outdated
end

function _cox_fgh!(β, grad, hes, c::CoxAux{T}, compute_derivatives)::T where T
#get relevant quantities to compute negative loglikelihood, gradient and hessian
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more common to put comments about the whole function before it's signature.

@piever
Copy link
Contributor Author

piever commented Aug 31, 2017

Regarding cases to test, I would try to cover a few scenarios, like: model without independent variables, model with a single continuous variable, model with a single categorical variable, model with both. Same for interactions: interaction between continuous variables, interaction between categorical variables, interaction between continuous and categorical. Of course there's no limit to the number of combinations one can try, so you need to stop at some point. :-)

Good intuition, there actually is an issue with categorical variables:
when I do @formula(y ~ 0 + x) and x is categorical, I get one term per value, which is actually overcomplete. Do you know a clean solution?

I'm actually tempted to say that it is a bug in @formula, because even when it is explicit requested to not have an intercept, as a matter of fact it puts back one in case of PooledDataArrays

@nalimilan
Copy link
Member

Glad to hear that being a pain can sometimes be useful! Actually it's perfectly expected that 0 + x gives a full rank matrix when x is categorical. This is the correct specification e.g. for a linear regression model without a separate intercept (but where effects are included in the dummies).

Anyway the 0 + hack was not ideal for the Cox models, so this seems to indicate we need a deeper solution. Maybe there should be an argument when building the model matrix to say that the model family does not allow estimating the intercept/baseline hazard.

@kleinschmidt Do you have ideas about the best way of supporting this? Maybe we should open an issue in StatsModels.jl.

@piever
Copy link
Contributor Author

piever commented Sep 2, 2017

I'm happy to give a look in StatsModels and see if there's a way to add an option to avoid the intercept for continuous and categorical variables and try to make that option the default in case of Cox regression via a dependency on StatsModels. A possible design is to add a function in StatsModels:

add_intercept(T) = false

that determines whether here an intercept should be automatically included or not. Then all we'd need to do would be overload it in Survival with:

StatsModels.add_intercept(T::Type{CoxModel}) = true

But first it'd be good to understand a bit better what's the dependency situation with DataFrames/DataTables/StatModels now and in the immediate future, I'm still a bit confused about it.

@nalimilan
Copy link
Member

But first it'd be good to understand a bit better what's the dependency situation with DataFrames/DataTables/StatModels now and in the immediate future, I'm still a bit confused about it.

StatsModels used to depend on DataTables, but it's going to switch to DataFrames once it will be ported from DataArrays to Nulls. Anyway, that doesn't really change its principle, which is that packages implementing models do not need to depend on either of these packages, and yet get the formula interface for free.

StatsModels.add_intercept is an interesting approach, but that would force packages to depend on StatsModels, and therefore on DataFrames/DataTables. We could define the function in StatsBase instead to avoid this. Better file an issue against StatsModels with this suggestion to discuss it.

@kleinschmidt
Copy link
Member

kleinschmidt commented Sep 2, 2017

I'm trying to catch up but not clear what the problem with a full rank matrix with ~0+x is. Perhaps it would be best to open an issue over at JuliaStats/StatsModels.jl, and explain what the desired behavior is? The current behavior is this;

  • With ~1+... or ~..., an intercept column is generated and default (dummy coded) contrasts are used for categorical variables, where n-1 indicator columns are generated for n levels)
  • With ~0+..., no intercept column is generated, and the first categorical variable in a main effect is "promoted" to full-rank dummy coding (n indicator columns for n levels) in order to avoid a rank-deficient model matrix.

I think there's pretty good consensus about getting rid of the default intercept term, which would make the ~x equivalent to ~0+x instead of ~1+x, and get rid of the 0+x altogether. I want to make that change in the first proper release of StatsModels since it's a pretty breaking change...

One possibility is that if you specify contrasts manually it'll use them even if it's non-sensical for a regression analysis (generates an overcomplete or deficient matrix...). Or we could add a flag that would do that.

@piever
Copy link
Contributor Author

piever commented Dec 15, 2017

No problem, I've just rebased, added tests for categorical and continuous cases (and all three pairwise combinations) and added the correct StatsModels dependency. Should be good to go now.

@nalimilan
Copy link
Member

Great! I'll let @ararslan do the honours...

step_size = one(T)
while fgh!(x+search_dir*step_size, grad, hes, false) > y+c*step_size*(grad' * search_dir)
step_size *= ρ
step_size > 1e-10 || error("Linesearch failed! Problematic Hessian or Gradient?")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get coverage to 100% if we could come up with a test that hits this as well as the ConvergenceException below. Is that doable?

Copy link
Contributor Author

@piever piever Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add a small testset for the Newton algorithm tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@test_throws

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. Other than that, I think this is good to go. You've done such excellent work here!!

@@ -1,3 +1,5 @@
julia 0.6
StatsBase
Distributions
PositiveFactorizations
StatsModels 0.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, technically the StatsModels tag should have been 0.3.0 since it introduces a new feature, but whatevs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to make a decision about this, but in my book major releases are only needed when breaking backward compatibility. Else, almost all of our releases need to be major ones...

Copy link
Member

@ararslan ararslan Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that, per semantic versioning, patch releases (i.e. the z in x.y.z) can only contain fixes which do not introduce new behaviors or change existing behaviors. Minor releases (the y) should be used when introducing new features or when making a backwards-incompatible change only if x = 0. Major releases (the x) should be used for backwards-incompatible changes when x >= 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but before 1.0 we use the patch release as the minor release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I don't actually care that much anyway. 😛

@piever
Copy link
Contributor Author

piever commented Dec 16, 2017

Added tests for Newton's algorithm: should be back to full coverage now.

@piever
Copy link
Contributor Author

piever commented Dec 16, 2017

I've also added a fix to the docs (fit needed a type signature as both KaplanMeier and Cox are using it in the API). Cox could potentially have more extensive documentation, but I'd rather have this merged first (before it gets "conflicts" again) and add more docs later, especially because there's still a fair amount of stuff to port from AcceleratedFailure.

@ararslan ararslan merged commit 3cbcaf0 into JuliaStats:master Dec 16, 2017
@ararslan
Copy link
Member

Thanks so much for this!

@piever piever deleted the cox branch December 19, 2017 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants