Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance download_and_extract #8216

Merged
merged 12 commits into from
Dec 21, 2024
Merged

Conversation

Jerome-Hsieh
Copy link
Contributor

@Jerome-Hsieh Jerome-Hsieh commented Nov 16, 2024

Fixes #5463

Description

According to issue, the error messages are not very intuitive.
I think maybe we can check if the file name matches the downloaded file’s base name before starting the download.
If it doesn’t match, it will notify user.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@ericspod
Copy link
Member

Hi @Jerome-Hsieh If you're having issues with putting this PR together we can discuss how to resolve them, but please do not close and then open a new PR. Thanks for the contribution!

monai/apps/utils.py Outdated Show resolved Hide resolved
monai/apps/utils.py Outdated Show resolved Hide resolved
monai/apps/utils.py Outdated Show resolved Hide resolved
monai/apps/utils.py Outdated Show resolved Hide resolved
Signed-off-by: jerome_Hsieh <[email protected]>
Signed-off-by: jerome_Hsieh <[email protected]>
Signed-off-by: jerome_Hsieh <[email protected]>
Signed-off-by: jerome_Hsieh <[email protected]>
@Jerome-Hsieh
Copy link
Contributor Author

@ericspod @KumoLiu
Here's newest change. like we discussed before. I check the filepath every time at the very beginning.
If no extension is provided, the system will automatically add the appropriate file extension with a warning.
A warning will also appear if provided extension doesn't match the downloaded file's extension .

@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 5, 2024

@ericspod @KumoLiu Here's newest change. like we discussed before. I check the filepath every time at the very beginning. If no extension is provided, the system will automatically add the appropriate file extension with a warning. A warning will also appear if provided extension doesn't match the downloaded file's extension .

Hi @Jerome-Hsieh, thank you for the quick update. I tested your latest changes, but neither "." nor "./test" as the file path seems to work. Is this the expected behavior? In my opinion, both should work. The logic could be adjusted so that if the file path is a directory and not empty, it automatically captures the name and downloads the file into the specified directory. What do you think?

url = "https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz"
monai.apps.utils.download_and_extract(url, filepath="./test")

@Jerome-Hsieh
Copy link
Contributor Author

Hi @KumoLiu my test results:

>>> url = "https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz"
>>> monai.apps.utils.download_and_extract(url, filepath=".")
2024-12-06 18:06:56,333 - INFO - Expected md5 is None, skip md5 check for file ..
2024-12-06 18:06:56,334 - INFO - File exists: ., skipped downloading.
2024-12-06 18:06:56,334 - INFO - Non-empty folder exists in ., skipped extracting.

and

>>> monai.apps.utils.download_and_extract(url, filepath="./test")
2024-12-06 18:08:41,246 - WARNING - filepath=./test, which missing file extension. Auto-appending extension to: ./test.tar.gz
test.tar.gz: 59.0MB [00:10, 6.07MB/s]
2024-12-06 18:08:51,443 - INFO - Downloaded: test.tar.gz
2024-12-06 18:08:51,443 - INFO - Expected md5 is None, skip md5 check for file test.tar.gz.
2024-12-06 18:08:51,443 - INFO - Writing into directory: ..

If the results like above, that's expected behavior.

Your opinion sounds like
url=example.com/file.tar.gz, filepath=./test
the program will work :./test/file.tar.gz.
If I don't misunderstanding your thoughts, this method although can automatically captures the name, not only capture the strange filename make user confused if the URL don't have appropriate naming, but also can't allow user set the filename they want.

@mingxin-zheng
Copy link
Contributor

Regarding file downloading from URL, sometimes we can infer the name from the content deposition. Here is one example snippet: https://github.com/Project-MONAI/VLM/blob/7be688aa457a1806f908eb758f2f3ee816fea017/m3/demo/experts/utils.py#L135 @KumoLiu

@Jerome-Hsieh
Copy link
Contributor Author

Thank @mingxin-zheng provide information to find file name.
But I want to say that at first my initial approach to this issue that filepath should be a compressed file rather than a directory.
I wanted to focus on making filepath valid. Method provided by @KumoLiu it's great, but which somewhat deviates from my original intent.
Sorry for not providing a clear explanation.

Signed-off-by: jerome_Hsieh <[email protected]>
@Jerome-Hsieh
Copy link
Contributor Author

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning.
Besides, I also add if set the wrong file extension, it will raise the error and stop downloading.
Please let me know if you have any suggestions. Thx!

@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 16, 2024

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning. Besides, I also add if set the wrong file extension, it will raise the error and stop downloading. Please let me know if you have any suggestions. Thx!

Hi @Jerome-Hsieh, thank you for the update. After our discussion in the development meeting, I realized that there is an output_dir argument in download_and_extract. I apologize for any confusion my previous comments may have caused. It would be great to proceed by considering only the extension as before. Sorry again for the misunderstanding.

monai/apps/utils.py Outdated Show resolved Hide resolved
@Jerome-Hsieh
Copy link
Contributor Author

Jerome-Hsieh commented Dec 16, 2024

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning. Besides, I also add if set the wrong file extension, it will raise the error and stop downloading. Please let me know if you have any suggestions. Thx!

Hi @Jerome-Hsieh, thank you for the update. After our discussion in the development meeting, I realized that there is an output_dir argument in download_and_extract. I apologize for any confusion my previous comments may have caused. It would be great to proceed by considering only the extension as before. Sorry again for the misunderstanding.

Hi @KumoLiu I see, and what do you think about previous version a9a0171 I commit ?
Could you please identify any areas that require improvement or could be further enhance ? Thx!

@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 16, 2024

Hi @KumoLiu I see, and what do you think about previous version a9a0171 I commit ? Could you please identify any areas that require improvement or could be further enhance ? Thx!

Hi @Jerome-Hsieh, a9a0171 looks good. The only update needed is how to extract the file extension from the URL.

For example, given a URL like "https://drive.google.com/u/1/uc?id=1KntZge40tAHgyXmHYVqZZ5d2p_4Qr2l5&export=download", you can use get_filename_from_url to determine the exact extension based on the filename. What do you think? Please also include the tests, thanks!

@KumoLiu KumoLiu closed this Dec 16, 2024
@KumoLiu KumoLiu reopened this Dec 16, 2024
Signed-off-by: jerome_Hsieh <[email protected]>
@Jerome-Hsieh
Copy link
Contributor Author

Hi @Jerome-Hsieh, a9a0171 looks good. The only update needed is how to extract the file extension from the URL.

For example, given a URL like "https://drive.google.com/u/1/uc?id=1KntZge40tAHgyXmHYVqZZ5d2p_4Qr2l5&export=download", you can use get_filename_from_url to determine the exact extension based on the filename. What do you think? Please also include the tests, thanks!

@KumoLiu the URL you provide can download but can't extract, I extract manually still fail.
And I found that download large file from google drive there's no content deposition to get filename, so I use web parsing to find the filename, when download small file just use the same way to get filename.
Here's my test URL "https://drive.google.com/uc?id=18u7FlNtqn5K1H308bGZwp6YNAKxpJz8-"

Signed-off-by: jerome_Hsieh <[email protected]>
Signed-off-by: jerome_Hsieh <[email protected]>
Signed-off-by: jerome_Hsieh <[email protected]>
monai/apps/utils.py Outdated Show resolved Hide resolved
Signed-off-by: jerome_Hsieh <[email protected]>
@Jerome-Hsieh
Copy link
Contributor Author

@KumoLiu Thanks a lot for your assistance!It looks more cleaner than before!

@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 19, 2024

/build

@KumoLiu KumoLiu requested a review from ericspod December 19, 2024 10:01
@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 19, 2024

@ericspod do you have any further comments here? Otherwise I will try to merge this one, thanks.

@KumoLiu KumoLiu enabled auto-merge (squash) December 21, 2024 14:18
@KumoLiu KumoLiu merged commit efff647 into Project-MONAI:dev Dec 21, 2024
28 checks passed
@Jerome-Hsieh Jerome-Hsieh deleted the issue5463 branch December 21, 2024 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parameter filepath of download_and_extract
4 participants