Skip to content

PyOpenSci REVIEW - config and download default directories #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Robaina
Copy link
Owner

@Robaina Robaina commented Jan 26, 2023

This PR addresses:

Addressing reviewer @Batalex comments in PyOpenSci:

CLI

  • The defaults paths are quite inconvenient if I run the cli from the installed package rather than the source directory, the config file as well as the downloaded database ends up in /home/<user>/miniconda3/envs/Pynteny/lib/python3.10/site-packages/

Right. Regarding the config file, I think I should perhaps rename it to avoid confusion. It is not really intended to be modified by the user (like a proper config file). Rather, it's been used to store session information, such as whether or not the PGAP database has already been downloaded as well as the path to PGAP. This way, I find it ok to store this file within the installation directory since the user will not interact with the file.

Regarding the default dir in which to download the PGAG database, what's a better alternative to the installation directory for the default path? The user can still select a different location via the flag pynteny download --outdir. The advantage I see placing the PGAP's database in the installation directory is that it can be easily removed when uninstalling pynteny with conda. And I don't see a clear disadvantage in placing the PGAP database in the install directory by default. Could you elaborate on why it is inconvenient?

pynteny/subcommands.py

  • wget is quite old and does not seem to be active. How about replacing it with a more maintained alternative? e.g. httpx, requests

I find python-wget useful and very easy to use (e.g download file and show a progress bar with a single method download. However, it is true that it hasn't been updated since 2015.

I'm getting an error with requests

EOFError: Compressed file ended before the end-of-stream marker was reached

it seems like the file doesn't get downloaded in its totality. I have tried httpx and it gives me this error:

httpx.ReadTimeout: The read operation timed out

I think we can stick with python-wget for now, it works well. I think we can worry about it once it stops working.

Not really, changed!

@Robaina Robaina added enhancement New feature or request code review labels Jan 26, 2023
@Robaina Robaina self-assigned this Jan 26, 2023
@Robaina Robaina linked an issue Jan 26, 2023 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Jan 26, 2023

Codecov Report

Base: 97.50% // Head: 97.50% // No change to project coverage 👍

Coverage data is based on head (ed67965) compared to base (e2a37ef).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #51   +/-   ##
=======================================
  Coverage   97.50%   97.50%           
=======================================
  Files           5        5           
  Lines         160      160           
=======================================
  Hits          156      156           
  Misses          4        4           
Impacted Files Coverage Δ
tests/test_integration_search.py 94.73% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@Batalex
Copy link
Contributor

Batalex commented Jan 26, 2023

What about adding the option to choose the directory where to download the database / write the config file? I think this may fix the issue

Purely on a user experience basis, adding the option might be good, but the default behavior should be changed IMO.
The more you ease the first contact experience, as I call it, the more likely users will stick with your tool.

@Robaina
Copy link
Owner Author

Robaina commented Jan 26, 2023

What about adding the option to choose the directory where to download the database / write the config file? I think this may fix the issue

Purely on a user experience basis, adding the option might be good, but the default behavior should be changed IMO. The more you ease the first contact experience, as I call it, the more likely users will stick with your tool.

Makes sense that the first contact experience should be smooth. Still not sure which default directory would guarantee that, i.e., we could select the current working directory as the default directory to download the PGAP database, however, I'm not sure that's the best place either.

I personally like the PGAP database downloaded in the install directory, as I won't need to do anything with these files, Pynteny will take care of fetching them. On the other hand, storing the database in the installation directory is less prone to errors due to relocation of the database. If relocated, pynteny wouldn't know where to look to retrieve the HMM files within the PGAP database. The user would have to modify the config file to tell pynteny the new location.

Btw, I forgot about it, but it was already possible to select a different directory with pynteny download --outdir

@Batalex
Copy link
Contributor

Batalex commented Jan 26, 2023

I don't have access to a computer right on, how much does the database weight again?

If manageable, how do you feel about packaging the compressed file within the package itself?

@Batalex
Copy link
Contributor

Batalex commented Jan 26, 2023

At least we should figure out a way to delete the database if the package is uninstalled from the environment

@Robaina
Copy link
Owner Author

Robaina commented Jan 27, 2023

At least we should figure out a way to delete the database if the package is uninstalled from the environment

i see the problem you are pointing to now. Config file and database remain in the environment after removing pynteny. This is definitely inconvenient. I'll relocate these files inside the package itself, hopefully, they will get removed when uninstalling. Will check this and let you know. Thanks...

EDITED:

Ok, they are not removed when uninstalling pynteny ("Would not remove (might be manually added):". Will try any of the other options, then

@Robaina Robaina marked this pull request as ready for review January 27, 2023 13:03
@Robaina Robaina closed this Jan 27, 2023
@Robaina Robaina reopened this Jan 27, 2023
@Robaina Robaina force-pushed the 49-pyopensci-review-pyntenydownload-and-config-default-paths branch from 45193e6 to 8215ff6 Compare January 27, 2023 15:36
@Batalex
Copy link
Contributor

Batalex commented Jan 29, 2023

Ok, I might have figured something out in the end. Here is a summary of the possible approaches.

Packaging the compressed file within the package

Pros:

  • No need to download the files afterward
  • The files are deleted with the package
  • The code base can be simplified: the download subcommand and the config file could be removed (as well as the wget dependencies).

Cons:

  • Well, that would add over 300 MB to the package, translating to a bigger git repository and longer download/install times for users. Plus, some users might not be interested in this specific database so this extensive archive would be deadweight for them.
  • I you have multiple versions of Pynteny on your workstation (because of virtual environments), the data file is duplicated.

Adding empty files with the same names

So I thought of something else: to "cheat" Python into uninstalling the file, we can add empty files with the database + metadata names to the package. The download subcommand would then replace those files inside the virtual environment.

Pros:

  • The files are deleted with the package
  • The code base could be simplified: the workflow could be streamlined. The default configuration file would not be needed anymore.

Cons:

  • I you have multiple versions of Pynteny on your workstation (because of virtual environments), the data file is duplicated. However, since downloading the file is not mandatory, you could download the database only once and use the configuration flags/file for all Pynteny installs.
  • Updating the package inside the venv would overwrite the database, meaning that the user would download the database again

Using a single, documented path

Something like ~/.pynteny.

Pros:

  • No leftover files inside the venv after uninstall
  • No overwrite on updates

Cons:

  • Removal is manual

Do you have a preference among those three propositions?

@Robaina
Copy link
Owner Author

Robaina commented Jan 30, 2023

Hi @Batalex,

thanks for these suggestions, these are very creative, particularly the one cheating Python!

so, I have given it some thought. While I like the idea of installing the database within the package, I think the best and easiest option is to download the database somewhere else and make download location a required argument, so the user selects where to download it. The documentation will include a note clarifying that the database cannot be relocated.

At the end this HMM database is optional. This way it is the user who is responsible for its deletion once used. Will make the adjustments to the code in the following days.

@Batalex
Copy link
Contributor

Batalex commented Jan 30, 2023

You're welcome. I will update the review comment to check what has been done since then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code review enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

PyOpenSci REVIEW: pynteny.download and config default paths
2 participants