Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multithreaded YouTube downloader from zimscraperlib #301

Closed

Conversation

dan-niles
Copy link
Collaborator

Close #118

Use newer multithreaded YouTube downloader from zimscraperlib to download videos, thumbnails and subtitles.

Changes:

  • Remove "batch" downloading of videos and download_video_files_batch method.
  • Update download_video, download_thumbnail and download_subtitle methods to use new multithreaded downloader from scraperlib.
  • Update the above methods to download the files asynchronously.

@dan-niles dan-niles self-assigned this Aug 30, 2024
Copy link

codecov bot commented Aug 30, 2024

Codecov Report

Attention: Patch coverage is 0% with 77 lines in your changes missing coverage. Please review.

Project coverage is 1.54%. Comparing base (eedd0b2) to head (69cb71c).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
scraper/src/youtube2zim/scraper.py 0.00% 77 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #301      +/-   ##
========================================
+ Coverage   1.52%   1.54%   +0.01%     
========================================
  Files         11      11              
  Lines       1112    1102      -10     
  Branches     166     158       -8     
========================================
  Hits          17      17              
+ Misses      1095    1085      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dan-niles dan-niles marked this pull request as ready for review August 30, 2024 19:05
@dan-niles dan-niles requested a review from rgaudin August 30, 2024 19:07
@rgaudin rgaudin requested review from benoit74 and removed request for rgaudin August 31, 2024 14:25
@rgaudin
Copy link
Member

rgaudin commented Aug 31, 2024

Changed reviewer as @benoit74 is coming back

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not working exactly as intended, when a video fails to download, it is not counted as failed and final ZIM is hence created with no video at all (in my test where I not-so-intentionally had an issue with youtube-dlp).

download_video and download_thumbnail are now returning a future, so one has to check its status after completion (not an easy feat tbh).

@dan-niles dan-niles force-pushed the use-multithreaded-yt-downloader-from-zimscraperlib branch 2 times, most recently from 9d3e4e4 to b1304aa Compare September 5, 2024 06:36
@dan-niles dan-niles force-pushed the use-multithreaded-yt-downloader-from-zimscraperlib branch from b1304aa to 962b6b3 Compare September 6, 2024 12:24
@dan-niles dan-niles marked this pull request as draft September 6, 2024 18:02
@dan-niles
Copy link
Collaborator Author

@benoit74 I've updated the code to handle errors correctly. It now properly returns both the succeeded and failed video_ids. I've also mentioned the time taken to create a ZIM of the openZIM_testing channel with different values for concurrency (without using s3 cache) below.

Concurrency Time (s)
1 40.16
2 25.29
4 21.11

However, I realized that most ZIMs in zimfarm are downloaded using the S3 cache. With this approach, only the videos that are not in the S3 cache and are downloaded directly using yt-dlp will benefit from the parallelization. This is because I'm using the ThreadPoolExecutor in YoutubeDownloader from python-scraperlib, rather than downloading in batches as suggested in #118 (comment). Therefore, this change will only provide a performance gain in cases where videos are missing from the S3 cache.

Given that most downloads rely on the S3 cache, would it be better to keep the current approach of downloading in batches and using the YoutubeDownloader from scraperlib without concurrent execution?

WDYT?

@benoit74
Copy link
Collaborator

benoit74 commented Sep 9, 2024

Given that most downloads rely on the S3 cache, would it be better to keep the current approach of downloading in batches and using the YoutubeDownloader from scraperlib without concurrent execution?

This is indeed a very important point! Current implementation allow concurrent download of both S3 cached videos and Youtube videos, and also reencoding.

I'm sorry to not have noticed this before. I feel a bit ashamed to have wasted your time on this. Thank you for speaking up, it would have been a bad move indeed!

I propose to close the issue and the PR as "non-relevant", WDYT? Or at least we should enhance the scraperlib, but this is not a small change to implement.

@dan-niles
Copy link
Collaborator Author

No worries at all! I should have considered this earlier before moving forward with the implementation. I'm fine with closing this issue as "non-relevant." However, I agree that using the YouTube downloader from scraperlib restricts us from achieving concurrency with S3 downloads and video reencoding.

@benoit74 benoit74 closed this Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use newer multithreaded youtube downloader from scraperlib
3 participants