Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full package review for v0.3.0 #394

Closed
wants to merge 1,144 commits into from
Closed

Full package review for v0.3.0 #394

wants to merge 1,144 commits into from

Conversation

joshwlambert
Copy link
Member

This PR is to provide a platform to review the entirety of the package.

Once this review concludes I will release v0.3.0 on GitHub and submit to CRAN.

Please see the NEWS.md file for an overview of changes between v0.2.0 and v0.3.0. If you would prefer to review with a partial package review only showing the changes between v0.2.0 and v0.3.0 please let me know and I can open one.

This PR is unconventional as it is not intended for merging or for additional commits (unless minor) and instead comments will be converted to issues and these will be addressed in their own PRs.

joshwlambert and others added 30 commits June 10, 2024 15:33
Copy link
Member

@chartgerink chartgerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joshwlambert for opening the full package review! It's really good work and a big achievement 🙌

Two thumbs up animated GIF

I honestly do not have much to remark. I left some comments for your consideration, but nothing major. I am sure I missed minor issues that can still be improved, but it looks good overall. It is a lot of code so I am hoping for some more eyes on this because I'll inevitably miss aspects (I too, get tired 😄 ).

I will leave you with two more general questions:

  1. Is there a general testing strategy that you applied? It is a lot right now and it would help me understand whether there are any additional strategic considerations to make. For example, did you ensure for every positive (success) test you also had a negative (failure) test to ensure it is not giving a false negative?
  2. You know you may get comments on the use of the for loops, instead of vectorizing 😄 I honestly do not mind, but to preempt this question, could you expand on the use of for loops instead of vectorizing in some places?

R/accessors.R Show resolved Hide resolved
R/checkers.R Show resolved Hide resolved
#' @return A boolean `logical` whether the object is a valid `<epiparameter>`
#' object.
#' @export
test_epiparameter <- function(x) { # nolint cyclocomp_linter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this nolint a few times now, may it be easier to be more explicit by setting the cyclocomp parameter to allow deeper complexity? Or leave a comment why this nolint is relevant specifically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having the lintr check the cyclomatic complexity as I feel it does help me not to write overly complex functions. The default is 15 (https://lintr.r-lib.org/reference/cyclocomp_linter.html) which I think I'll leave the same for now. As you can see I don't comply with the cyclocomp linter too strictly by occasionally using # nolint cyclocomp_linter.

I think the best approach would either be to comply with the linter more strictly and remove the # nolint flags or use them sparingly like the current approach. I think all other Epiverse-TRACE packages use the default linter so it would be good to try and stay consistent across the organisation.

R/epiparameter.R Show resolved Hide resolved
R/epiparameter.R Show resolved Hide resolved
R/epiparameter.R Show resolved Hide resolved
R/epiparameter_db.R Show resolved Hide resolved
R/epiparameter_db.R Show resolved Hide resolved
lapply(lst, function(x) {
if (nse_subject %in% names(x)) {
# <epiparameter> is only nested once so no need for recursive search
eval(expr = condition, envir = x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels slightly vulnerable to evaluate an expression without adding some checks to it. It is an opportunity for some code to be injected unknowingly. If you can add a check to ensure it's (semi) in the format you would expect, that would make the function more robust as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this suggestion, but nothing comes immediately to mind as to how we could test this conditions before they get evaluated. Do you have ideas for how to do this?

po/R-epiparameter.pot Show resolved Hide resolved
@TimTaylor TimTaylor assigned TimTaylor and unassigned TimTaylor Oct 16, 2024
@TimTaylor TimTaylor self-requested a review October 16, 2024 08:57
Copy link
Contributor

@TimTaylor TimTaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @joshwlambert. I just wanted to get down some, predominantly high-level, comments about the epiparameter class before any CRAN release. As I had limited time I have had to focus on R/epiparameter.R.

I'd like to take a further look at the rest of the package and the vignettes (at a glance they look very thorough) but this won't happen in the near future.

Copy link
Contributor

@TimTaylor TimTaylor Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments on this file:

  • At present new_epiparameter() is only used within epiparameter(). Is it worth pulling the functionality in to epiparameter() itself. A minimal constructor is normally useful when you need to quickly recreate an object after a method has dropped it's class.
  • chkDots() seems to be used in some methods but not others.
  • I see you use assert_epiparameter() at the start of some epiparameter methods in other files (e.g. convert_params_to_summary_stats.epiparameter()). Is this a "belts and braces" approach or do you think there is something fragile about the epiparameter class that makes it likely to be broken by users?
  • As discussed offline, I wonder if there is an intermediate more tidy-like tabular data structure that would be useful. I'm not sure but leaving here for my own pondering ...
library(jsonlite)
library(dplyr)

dat <- read_json(
  path = system.file(
    "extdata",
    "parameters.json",
    package = "epiparameter",
    mustWork = TRUE
  )
)

out <- lapply(
  dat,
  function(x) {
    x[4:9] <- lapply(x[4:9], list)
    as_tibble(x)
  }
)

(out <- bind_rows(out))
#> # A tibble: 125 × 9
#>    disease           pathogen epi_name probability_distribu…¹ summary_statistics
#>    <chr>             <chr>    <chr>    <list>                 <list>            
#>  1 Adenovirus        Adenovi… incubat… <named list [2]>       <named list [8]>  
#>  2 Human Coronavirus Human_C… incubat… <named list [2]>       <named list [8]>  
#>  3 SARS              SARS-Co… incubat… <named list [2]>       <named list [8]>  
#>  4 Influenza         Influen… incubat… <named list [2]>       <named list [8]>  
#>  5 Influenza         Influen… incubat… <named list [2]>       <named list [8]>  
#>  6 Influenza         Influen… incubat… <named list [2]>       <named list [8]>  
#>  7 Measles           Measles… incubat… <named list [2]>       <named list [8]>  
#>  8 Parainfluenza     Parainf… incubat… <named list [2]>       <named list [8]>  
#>  9 RSV               RSV      incubat… <named list [2]>       <named list [8]>  
#> 10 Rhinovirus        Rhinovi… incubat… <named list [2]>       <named list [8]>  
#> # ℹ 115 more rows
#> # ℹ abbreviated name: ¹​probability_distribution
#> # ℹ 4 more variables: citation <list>, metadata <list>,
#> #   method_assessment <list>, notes <list>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: Updated the comment above as forgot to save it before submitting review.

}
}

if (epi_name == "offspring_distribution") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

epi_name appears to be free form but this is quite specific. Is it a remnant from past design? Is it worth being more restrictive on values allowed for epi_name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made some improvements to this in PR #401 to try and match non-delay distributions, which currently include offspring distributions and case fatality risks.

The data dictionary used in the JSON validation workflow uses an enum field to check that the epi_name is within a predefined set https://github.com/epiverse-trace/epiparameter/blob/main/inst/extdata/data_dictionary.json#L23.

However, this is a temporary solution and needs to be improved to be more scalable as more non-delay distributions are added to the package, and to accommodate users creating <epiparameter> objects with a variety of parameter names. We could impose that the epi_name argument must also be within the same set as defined in the data dictionary in when specified by the user in epiparameter() but I don't think there would be that much benefit, while it could hinder users and put more development burden adding more parameter types.

Happy to discuss this further to find a more optimal solution. This can optionally be filed as an issue to continue discussion after the package review.

R/epiparameter.R Show resolved Hide resolved
)

# call epiparameter validator
assert_epiparameter(epiparameter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? It seems like you have done this validation within this function itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is superfluous. I like having it to ensure the object created is valid. I think it is impossible to get to the assert_epiparameter() call with an invalid <epiparameter> object, but just as an extra layer of defence I've added it.

As the time to run this extra function is negligible I do not think there is much reason to remove, but if users were creating many thousand/million <epiparameter> objects this would be a good first thing to remove for a speed gain.

If there are other reasons to remove that I've overlooked please let me know.

R/epiparameter.R Show resolved Hide resolved
R/epiparameter.R Show resolved Hide resolved
R/epiparameter.R Show resolved Hide resolved
@joshwlambert
Copy link
Member Author

Thank you both for helpful comments and suggestions @chartgerink & @TimTaylor! I've made several PRs with improvements resulting from this review which are either linked in my responses, or linked at the bottom of this PR (or both).

In response to some other questions and comments:

  1. Is there a general testing strategy that you applied? It is a lot right now and it would help me understand whether there are any additional strategic considerations to make. For example, did you ensure for every positive (success) test you also had a negative (failure) test to ensure it is not giving a false negative?

There currently is not a testing strategy. I usually try and write unit tests for both exported and internal functions for expected behaviour under most common usage scenarios, and then a couple of failure expectations to check the function errors nicely.

Given the namespace of this package is quite large (relative to other Epiverse-TRACE package), I've thought about moving to only testing exported functions. Some tests have been added to check that issues are resolved correctly, so it's been a fairly organic process of adding tests over time. I'd be happy to discuss possible testing strategies to make the process a bit more formal.

  1. You know you may get comments on the use of the for loops, instead of vectorizing 😄 I honestly do not mind, but to preempt this question, could you expand on the use of for loops instead of vectorizing in some places?

There is often not a formal logic on when to use vectorisation versus loops. Usually when I write code I will write loops (that's just how my brain works). Then I'll usually see how the code works once drafted and see if it can be optimised, usually by vectorising loops. Sometimes I've found I'm not able to vectorise so the loop remains, others I've found the loop more readable so left as is.

If there are specific functions that contain either loops or vectorised statements that you think could be converted to the other, please raise them as an issue and I'd be happy to update (e.g, epiverse-trace/simulist#150).

At present new_epiparameter() is only used within epiparameter(). Is it worth pulling the functionality in to epiparameter() itself. A minimal constructor is normally useful when you need to quickly recreate an object after a method has dropped it's class.

Currently, the split in functionality is epiparameter() does the input checking and handles the parameter uncertainty after the <epiparameter> object has been constructed in new_epiparameter(). new_epiparameter() calculates the distribution parameters from summary statistics, if available. I could potentially merge them as, like you say, new_epiparameter() is only ever called within epiparameter(). I'd be interested to hear your opinions on when to have a minimal class constructor and whether this should be strictly applying the class attribute. I saw you had a similar issue in {incidence2} (reconverse/incidence2#103) so would be good to get your thoughts as I assume a similar design choice could be taken across packages.

chkDots() seems to be used in some methods but not others.

chkDots() should only be used in method where extra arguments should not be passed via .... I will check if they are being used consistently and whether any other methods should call it.

I see you use assert_epiparameter() at the start of some epiparameter methods in other files (e.g. convert_params_to_summary_stats.epiparameter()). Is this a "belts and braces" approach or do you think there is something fragile about the epiparameter class that makes it likely to be broken by users?

Yes on the latter. It is incase an <epiparameter> is dispatched but the object has been invalidated by a user between being imported/constructed and the generic being called. I currently haven't implemented any subsetting or assignment operator methods for <epiparameter> (e.g. $, [, etc.) so this is to offset that.

As discussed offline, I wonder if there is an intermediate more tidy-like tabular data structure that would be useful. I'm not sure but leaving here for my own pondering ...

This relates to #362, I will copy over the code chunk to that issue to continue discussion there. I won't make any changes with regard to this point before the release, but can give it some thought over the next development cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants