partial rewrite

JuliaSurv · Dec 2, 2024 · 6ae6354 · 6ae6354
1 parent 7df430d
commit 6ae6354
Show file tree

Hide file tree

Showing 3 changed files with 30 additions and 40 deletions.
diff --git a/.vscode/ltex.disabledRules.en-US.txt b/.vscode/ltex.disabledRules.en-US.txt
@@ -0,0 +1,2 @@
+HAZARD
+LARGE
diff --git a/docs/src/example.md b/docs/src/example.md
@@ -2,18 +2,17 @@
 
 ## Introduction and datasets
 
-We will illustrate with an example using the dataset `colrec`, which comprises $5971$ patients diagnosed with colon or rectal cancer  between $1994$ and $2000$. This dataset is sourced from the Slovenia cancer registry. Given the high probability that the patients are Slovenian, we will be using the Slovenian mortality table `slopop` as reference for the populational rates. Subsequently, we can apply various non-parametric estimators for net survival analysis.
+We will illustrate with an example using the dataset `colrec`, which comprises $5971$ patients diagnosed with colon or rectal cancer between $1994$ and $2000$. This dataset is sourced from the Slovenia cancer registry. Given the high probability that the patients are Slovenian, we will be using the Slovenian mortality table `slopop` as reference for the population mortality rates. Subsequently, we can apply various non-parametric estimators for net survival analysis.
 
 !!! note "N.B." 
     Mortality tables may vary in structure, with options such as the addition or removal of specific covariates. To confirm that the mortality table is in the correct format, please refer to [`RateTables.jl`'s documentation](https://JuliaSurv.github.io/RateTables.jl/), or directly extract it from there.
 
 ### Cohort details
 
-The patients in the study are diagnosed between January 1st 1994 and December 31st 2000. Before we move on to the survival probabilities, it is important to be aware of how your data is distributed and of what it comprises. 
+The patients in the study are diagnosed between January 1st $1994$ and December 31st $2000$. Before we move on to the survival probabilities, it is important to be aware of how your data is distributed and of what it comprises. 
 
 ```@example 2
 using NetSurvival, RateTables, DataFrames
-
 first(colrec,10)
 ```
 
@@ -27,26 +26,19 @@ The study can be considered diverse in terms of age seeing as the patients are b
 
 ```@example 2
 using Plots 
-
-plot(
-    histogram(colrec.age./365.241, label="Age"),
-    histogram(colrec.time./365.241, label="Follow-up time")
-)
+plot(histogram(colrec.age./365.241, label="Age"),
+    histogram(colrec.time./365.241, label="Follow-up time"))
 ```
 
-The graph above show us that although it has a wide range of patients within all age groups, it is mostly centered around older adults and elderly, with the majority of the patients being between 60 and 80 years old. 
-
-Looking at the second graph that details the distribution of the follow-up times, we notice that the values quickly drop. Unfortunately, this is a common theme in cancer studies. 
+The graph above show us that although the dataset has a wide range of patients within all age groups, it is mostly centered around older adults and elderly, with the majority of the patients being between $60$ and $80$ years old. Looking at the second graph that details the distribution of the follow-up times. We notice there that the values quickly drop. Unfortunately, this is a common theme in cancer studies. 
 
-Let's take a look at the `sex` variable now: 
+Let's take a look at the `sex` variable now, by looking at the number of male and female patients:
 
 ```@example 2
 combine(groupby(colrec, :sex), nrow)
 ```
 
-This dataframe shows the number of male and female patients. There isn't too big of a difference between the two. We can say this study includes both gender relatively equally, thus, reducing bias. 
-
-With these two observations, it is also worth noting that colorectal cancer is most common with men and people older than 50.
+There isn't too big of a difference between the two. We can say this study includes both gender relatively equally, thus, reducing bias. With these two observations, it is also worth noting that colorectal cancer is most common with men and people older than $50$.
 
 In total, we note that we have $5971$ patients. By taking a look at the `status` variable, we can determine the deaths and censorship:
 
@@ -70,16 +62,16 @@ We will be using the mortality table `slopop` taken from the `RateTables.jl` pac
 slopop
 ``` 
 
-By examining `slopop`, we notice it contains information regarding `age` and `year`, as expected for mortality tables. Additionally, it incorporates the covariate sex, which has two possible entries (`:male` or `:female`). The ratetable is then three dimensional, with the covariate `sex` added. For example, the daily hazard rate for a woman turning $45$ on the January 1st $2006$ can be accessed through the following command:
+The show method of the `RateTable` class shows the additional covariate `sex` that the rate table has on top of the (mandatory) `age` and `year` variables. The `sex` variable has two madalities, `:male` and `:female`. The ratetable is then three dimensional. For example, the daily hazard rate for a woman turning $45$ on the January 1st $2006$ can be accessed through the following command:
 
 ```@example 2
-daily_hazard(slopop, 45*365.241, 2006*365.241; sex=:female)
+λ  = daily_hazard(slopop, 45*365.241, 2006*365.241; sex=:female)
 ``` 
 
-Making the survival probability easily calculated with:
+Making the daily survival probability easily calculated with:
 
 ```@example 2
-exp(-(daily_hazard(slopop, 45*365.241, 2006*365.241; sex=:female))*365)
+exp(-λ)
 ``` 
 
 ## Overall and expected survival
@@ -88,7 +80,6 @@ For this part, we will be using the `Survival.jl` package to apply the Kaplan Me
 
 ```@example 2
 using Survival 
-
 km = fit(KaplanMeier, colrec.time./365.241, colrec.status)
 plot(km.events.time, km.survival, label=false, title = "Kaplan-Meier Estimator for the Overall Survival")
 ```
@@ -97,23 +88,21 @@ The graph above indicates a significant dip in survival probability within the f
 
 ## Estimated net survival
 
-In this part, we are interested in the first $5$ years of the study. We will thus limit the follow-up time to $5$ years, meaning we will censor all individuals with a follow-up time that is higher than this. Then, we will apply the different net survival methods.
+We will restrict ourselves to the first $5$ years of the study. For that, let us re-censor the dataset as follows: 
 
 ```@example 2 
-colrec.time5 .= 0.0
-colrec.status5 .= Bool(true)
 for i in 1:nrow(colrec)
-    colrec.time5[i] = min(colrec.time[i], round(5*365.241))
-    if colrec.time[i] > 5*365.241
-        colrec.status5[i] = false
+    if colrec.time[i] > 1826 # five years
+        colrec.status[i] = false
+        colrec.time[i] = 1826
     end
 end
 ```
 
-Now that we have defined our own time and status variables according to the observations made, we can apply the different non parametric methods for relative survival.
+We can now apply the different non-parametric methods to compute the relative survival.
 
 ```@example 2
-e1 = fit(EdererI, @formula(Surv(time5,status5)~1), colrec, slopop)
+e1 = fit(EdererI, @formula(Surv(time,status)~1), colrec, slopop)
 ```
 
 With the EdererI method, after $1826$ days have passed, we can say that the survival rate at this mark is around $0.456$, in the hypothetical world where patients can only die of cancer.
@@ -125,25 +114,24 @@ println(crude_e1.Mₒ[1826], " , ", crude_e1.Mₑ[1826], " , ", crude_e1.Mₚ[18
 
 Out of the 0.63 patients that have died, according to the EdererI method, 0.51 died because of colorectal cancer and 0.12 died of other causes.
 
-
 ```@example 2
-e2 = fit(EdererII, @formula(Surv(time5,status5)~1), colrec, slopop)
+e2 = fit(EdererII, @formula(Surv(time,status)~1), colrec, slopop)
 ```
 
-Similarily, the EdererII method, also known as the conditional method, shows that at the $5$ year mark, the survival probability is of $0.44$ in this hypothetical world.
+Similarly, the EdererII method, also known as the conditional method, shows that at the $5$ year mark, the survival probability is of $0.44$ in this hypothetical world.
 
 ```@example 2
 crude_e2 = CrudeMortality(e2)
 println(crude_e2.Mₒ[1826], " , ", crude_e2.Mₑ[1826], " , ", crude_e2.Mₚ[1826])
 ```
 
-Here, out of the 0.63 patients that have dued, 0.53 are due to colorectal cancer and 0.1 due to other causes.
+Here, out of the 0.63 patients that have died, 0.53 are due to colorectal cancer and 0.1 due to other causes.
 
 ```@example 2
-pp = fit(PoharPerme, @formula(Surv(time5,status5)~1), colrec, slopop)
+pp = fit(PoharPerme, @formula(Surv(time,status)~1), colrec, slopop)
 ```
 
-We conclude for the Poher-Perme method, that in a world where cancer patients could only die due to cancer, only 41% of these patients would still be alive $5$ year after their diagnosis.
+We conclude for the Pohar Perme method, that in a world where cancer patients could only die due to cancer, only 41% of these patients would still be alive $5$ year after their diagnosis. The Pohar Perme estimator is the best estimator of the excess hazard under the standard hypotheses. 
 
 ```@example 2
 crude_pp = CrudeMortality(pp)
@@ -171,22 +159,22 @@ p2 = plot!(pp.grid, crude_pp.Mₚ, label = "Population Mortality Rate")
 plot(p1,p2)
 ```
 
-Looking at the graph, and the rapid dip it takes, it is evident that the first $5$ years are crucial and that the survival probability is highly affected in these years. Additionnally, the crude mortality graph allows us to see how much of this curve is due to the colorectacl cancer studied versus other undefined causes. It is clear that the large majority is due to the cancer.
+Looking at the graph, and the rapid dip it takes, it is evident that the first $5$ years are crucial and that the survival probability is highly affected in these years. Additionally, the crude mortality graph allows us to see how much of this curve is due to the colorectal cancer studied versus other undefined causes. It is clear that the large majority is due to the cancer.
 
 ## Net survival with respect to covariates
 
 We are now interested in comparing the different groups of patients defined by various covariates. 
 
 ```@example 2
-pp_sex = fit(PoharPerme, @formula(Surv(time5,status5)~sex), colrec, slopop)
+pp_sex = fit(PoharPerme, @formula(Surv(time,status)~sex), colrec, slopop)
 pp_males = pp_sex[pp_sex.sex .== :male,:estimator][1]
 pp_females = pp_sex[pp_sex.sex .== :female,:estimator][1]
 ```
 
 When comparing at time $1826$, we notice that the survival probability is slightly inferior for men than for women ($0.433 < 0.449$). It is also more probable for the women to die from other causes than the men seeing as $0.0255 > 0.025$. Still, the differences are minimal. Let's confirm this with the Grafféo log-rank test:
 
 ```@example 2
-test_sex = fit(GraffeoTest, @formula(Surv(time5,status5)~sex), colrec, slopop)
+test_sex = fit(GraffeoTest, @formula(Surv(time,status)~sex), colrec, slopop)
 ```
 
 The p-value is indeed above $0.05$. We cannot reject the null hypothesis $H_0$ and thus we dismiss the differences between the two sexes.
@@ -195,7 +183,7 @@ As for the age, we will define two different groups: individuals aged 65 and abo
 
 ```@example 2
 colrec.age65 .= ifelse.(colrec.age .>= 65*365.241, :old, :young)
-pp_age65 = fit(PoharPerme, @formula(Surv(time5,status5)~age65), colrec, slopop)
+pp_age65 = fit(PoharPerme, @formula(Surv(time,status)~age65), colrec, slopop)
 pp_young = pp_age65[pp_age65.age65 .== :young, :estimator][1]
 pp_old = pp_age65[pp_age65.age65 .== :old, :estimator][1]
 ```
@@ -207,7 +195,7 @@ It is also worth noting that their chances of dying from other causes is higher
 When applying the Grafféo test, we get the results below:
 
 ```@example 2
-test_age65 = fit(GraffeoTest, @formula(Surv(time5,status5)~age65), colrec, slopop)
+test_age65 = fit(GraffeoTest, @formula(Surv(time,status)~age65), colrec, slopop)
 ```
 
 The p-value is well under $0.05$, meaning we reject the $H_0$ hypothesis and must admit there are differences between the individuals aged 65 and above and the others.

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -8,7 +8,7 @@ The `NetSurvival.jl` package provides the necessary tools to perform estimations
 
 By integrating observed data from the target population with historical population mortality data (usually sourced from national census datasets), Net Survival allows the extraction of the specific mortality hazard associated with the particular disease, even under the missing indicatrix issue. The concept of relative survival analysis dates back several decades to the seminal article by Ederer, Axtell, and Cutler in 1961 [Ederer1961](@cite) and the one by Ederer and Heise in 1959 [Ederer1959](@cite).
 
-For years, the Hakulinen estimator (1977) [Hakulinen1977](@cite) and the Ederer I and II estimators were widely regarded as the gold standard for non-parametric survival curve estimation. However, the introduction of the Pohar-Perme, Stare, and Estève estimator in 2012 [PoharPerme2012](@cite) resolved several issues inherent in previous estimators, providing a reliable and consistent non-parametric estimator for net survival analysis.
+For years, the Hakulinen estimator (1977) [Hakulinen1977](@cite) and the Ederer I and II estimators were widely regarded as the gold standard for non-parametric survival curve estimation. However, the introduction of the Pohar Perme, Stare, and Estève estimator in 2012 [PoharPerme2012](@cite) resolved several issues inherent in previous estimators, providing a reliable and consistent non-parametric estimator for net survival analysis.
 
 ## Features